Merge misaligned pandas dataframes

marcopah

I have around 100 csv files. Each of them are written in to its own pandas dataframe and then merged later on and finally being written in to a database.

Each csv file contains a 1000 rows and 816 columns.

Here is the problem:

Each of the csv files contains the 816 columns but not all of the columns contains data. As a result of this some of the csv files are misaligned - the data has been moved left, but the column has not been deleted.

Here's an made up example:

CSV file A (which is correct):

    Name    Age City
    Joe     18  London
    Kate    19  Berlin
    Math    20  Paris

CSV file B (with misaglignment):

    Name    Age    City
    Joe     18     London
    Kate    Berlin
    Math    20     Paris

I would like to merge A and B, but my current solution results in a misalignment.

I'm not sure whether this is easier to deal with in SQL or Python, but I hoped some of you could come up with a good solution.

The current solution to merge the dataframes is as follows:

def merge_pandas(csvpaths):
    list = []
    for path in csvpaths:
        frame = pd.read_csv(sMainPath + path, header=0, index_col = None)
        list.append(frame)
    return pd.concat(list)

Thanks in advance.

FooBar

A generic solutions for these types of problems is most likely overkill. We note that the only possible mistake is when a value is written into a column to the left from where it belongs.

If your problem is more complex than the two column example you gave, you should have an array that contains the expected column type for your convenience.

types = ['string', 'int']

Next, I would set up a marker to identify flaws:

df['error'] = 0
df.loc[df.City.isnull(), 'error'] = 1

The script can detect the error with certainty

In your simple scenario, whenever there is an error, we can simply check the value in the first column.

  • If it's a number, ignore and move on (keep NaN on second value)
  • If it's a string, move it to the right

In your trivial example, that would be

def checkRow(row):
    try:
        row['Age'] = int(row['Age'])
    except ValueError:
        row['City']= row['Age']
        row['Age'] = np.NaN
    return row

df.apply(checkRow, axis=1)

In case you have more than two columns, use your types variable to do iterated checks to find out where the NaN belongs.

The script cannot know the error with certainty

For example, if two adjacent columns are both string value. In that case, you're screwed. Use a second marker to save these columns and do it manually. You could of course do advanced checks (it should be a city name, check whether the value is a city name), but this is probably overkill and doing it manually is faster.

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related