Merge misaligned pandas dataframes

debugcn Published at Dev

marcopah

I have around 100 csv files. Each of them are written in to its own pandas dataframe and then merged later on and finally being written in to a database.

Each csv file contains a 1000 rows and 816 columns.

Here is the problem:

Each of the csv files contains the 816 columns but not all of the columns contains data. As a result of this some of the csv files are misaligned - the data has been moved left, but the column has not been deleted.

Here's an made up example:

CSV file A (which is correct):

    Name    Age City
    Joe     18  London
    Kate    19  Berlin
    Math    20  Paris

CSV file B (with misaglignment):

    Name    Age    City
    Joe     18     London
    Kate    Berlin
    Math    20     Paris

I would like to merge A and B, but my current solution results in a misalignment.

I'm not sure whether this is easier to deal with in SQL or Python, but I hoped some of you could come up with a good solution.

The current solution to merge the dataframes is as follows:

def merge_pandas(csvpaths):
    list = []
    for path in csvpaths:
        frame = pd.read_csv(sMainPath + path, header=0, index_col = None)
        list.append(frame)
    return pd.concat(list)

Thanks in advance.

FooBar

A generic solutions for these types of problems is most likely overkill. We note that the only possible mistake is when a value is written into a column to the left from where it belongs.

If your problem is more complex than the two column example you gave, you should have an array that contains the expected column type for your convenience.

types = ['string', 'int']

Next, I would set up a marker to identify flaws:

df['error'] = 0
df.loc[df.City.isnull(), 'error'] = 1

The script can detect the error with certainty

In your simple scenario, whenever there is an error, we can simply check the value in the first column.

If it's a number, ignore and move on (keep NaN on second value)
If it's a string, move it to the right

In your trivial example, that would be

def checkRow(row):
    try:
        row['Age'] = int(row['Age'])
    except ValueError:
        row['City']= row['Age']
        row['Age'] = np.NaN
    return row

df.apply(checkRow, axis=1)

In case you have more than two columns, use your types variable to do iterated checks to find out where the NaN belongs.

The script cannot know the error with certainty

For example, if two adjacent columns are both string value. In that case, you're screwed. Use a second marker to save these columns and do it manually. You could of course do advanced checks (it should be a city name, check whether the value is a city name), but this is probably overkill and doing it manually is faster.

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at2021-06-29

Comments

0 comments

From Dev

Related Related

Article

Merge misaligned pandas dataframes

Merge misaligned pandas dataframes

Merge a list of pandas dataframes

merge two dataframes pandas

Merge multiple DataFrames Pandas

Expand and merge Pandas dataframes

Merge two dataframes with pandas

Merge multiple DataFrames Pandas

pandas merge dataframes on a condition

Merge two dataframes with pandas

Pandas: merge 2 dataframes

Merge two dataframes in pandas

Pandas, merge 2 dataframes

Python pandas merge or concat dataframes

Pandas: dataframes won't merge

Python: pandas merge multiple dataframes

Merge DataFrames in Pandas using the mean

pandas merge dataframes on closest timestamp

Python pandas merge or concat dataframes

Merge or Concat Dataframes using Pandas

Python pandas - merge dataframes rows

Pandas: merge multiple dataframes and control column names?

Pandas merge dataframes based on closest match

merge pandas dataframes under new index level

Pandas - merge two DataFrames with Identical Column Names

Join/Merge two pandas dataframes and filling

Pandas merge two dataframes with different columns

Merge pandas DataFrames based on irregular time intervals

How to merge two pandas DataFrames in Python?

merge multiple dataframes into one using pandas

Pandas: merge two dataframes ignoring NaN