How to remove duplicates from a dataframe?

Peter

My current df looks like this:

    IDnumber     Subid      Subsubid     Date        Originaldataindicator
      a            1          x           2006           NaN
      a            1          x           2007           NaN
      a            1          x           2008           NaN
      a            1          x           2008           1

The originaldataindicator is the result of the fact that some of these observations were created to get all three years for each IDnumber, while some existed in the original dataset. What I want to achieve is to drop the duplicates and prefarably keep the original data. Note that the originaldataindicator will not always be the last observation. To solve this I first sort on Idnumber Date Originaldataindicator

However when I use:

  df=df.drop_duplicates(subset=['IDnumber', 'Subid', 'Subsubid', 'Date'])

Nothing happens and I still observe the duplicate.

  df=df.drop_duplicates(subset=['IDnumber', 'Subid', 'Subsubid', 'Date'], inplace=True)

gives me an empty dataframe.

Am I misinterpreting what drop_duplicates does ?

Just to avoid confusion, this is what I want:

    IDnumber     Subid      Subsubid     Date        Originaldataindicator
      a            1          x           2006           NaN
      a            1          x           2007           NaN
      a            1          x           2008           1

The data includes thousands of these ID's

jezrael

I think you need groupby and sort_values and then use parameter keep=first of drop_duplicates:

print df
  IDnumber  Subid Subsubid  Date  Originaldataindicator
0        a      1        x  2006                    NaN
1        a      1        x  2007                    NaN
2        a      1        x  2008                    NaN
3        a      1        x  2008                      1
4        a      1        x  2008                    NaN

df = df.groupby(['IDnumber', 'Subid', 'Subsubid', 'Date'])
              .apply(lambda x: x.sort_values('Originaldataindicator')).reset_index(drop=True)
print df
  IDnumber  Subid Subsubid  Date  Originaldataindicator
0        a      1        x  2006                    NaN
1        a      1        x  2007                    NaN
2        a      1        x  2008                      1
3        a      1        x  2008                    NaN
4        a      1        x  2008                    NaN

df1=df.drop_duplicates(subset=['IDnumber', 'Subid', 'Subsubid', 'Date'], keep='first')
print df1
  IDnumber  Subid Subsubid  Date  Originaldataindicator
0        a      1        x  2006                    NaN
1        a      1        x  2007                    NaN
2        a      1        x  2008                      1

Or use inplace:

df.drop_duplicates(subset=['IDnumber','Subid','Subsubid','Date'], keep='first', inplace=True)
print df
  IDnumber  Subid Subsubid  Date  Originaldataindicator
0        a      1        x  2006                    NaN
1        a      1        x  2007                    NaN
2        a      1        x  2008                      1

If column Originaldataindicator have multiple values use duplicated (maybe ther can be add all columns IDnumber,Subid,Subsubid,Date) and isnull:

print df
  IDnumber  Subid Subsubid  Date  Originaldataindicator
0        a      1        x  2006                    NaN
1        a      1        x  2007                    NaN
2        a      1        x  2008                    NaN
3        a      1        x  2008                      1
4        a      1        x  2008                      1

print df[~((df.duplicated('Date',keep=False))&~(pd.notnull(df['Originaldataindicator'])))]
  IDnumber  Subid Subsubid  Date  Originaldataindicator
0        a      1        x  2006                    NaN
1        a      1        x  2007                    NaN
3        a      1        x  2008                      1
4        a      1        x  2008                      1

Explaining conditions:

print df.duplicated('Date', keep=False)
0    False
1    False
2     True
3     True
4     True
dtype: bool

print (pd.isnull(df['Originaldataindicator']))
0     True
1     True
2     True
3    False
4    False
Name: Originaldataindicator, dtype: bool

print ~((df.duplicated('Date', keep=False)) & (pd.isnull(df['Originaldataindicator'])))
0     True
1     True
2    False
3     True
4     True
dtype: bool

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

From Dev

remove duplicates from a dataframe in pyspark

From Dev

Remove similar duplicates from dataframe

From Dev

How to Conditionally Remove Duplicates from Pandas DataFrame with a List

From Dev

how to remove duplicates from a set?

From Dev

How to remove rows with duplicates in pandas dataframe?

From Java

How to update a dataframe, from another dataframe with duplicates

From Dev

How to remove all duplicates from a list?

From Dev

How to "remove duplicates" from a UNION query

From Dev

How to remove duplicates from MongoDB Collection

From Dev

How to remove duplicates from a list of object

From Dev

How to remove duplicates from a list python

From Dev

How to remove duplicates from a parallel array in Java?

From Dev

How to remove duplicates from json in PHP

From Dev

How to remove all duplicates from a list

From Dev

How to remove duplicates from my array of structs?

From Dev

How to "remove duplicates" from a UNION query

From Dev

How to remove duplicates from a parallel array in Java?

From Dev

How to remove duplicates from a list of object

From Dev

How to remove all of the duplicates from the table?

From Dev

How to remove duplicates from Multivalued Fields in SOLR?

From Dev

How do I remove duplicates from an array

From Dev

How remove duplicates from sql list?

From Dev

How to remove duplicates from circular linked list

From Dev

How to remove efficiently all duplicates in dataframe or csv file in python?

From Dev

How to remove efficiently all duplicates in dataframe or csv file in python?

From Dev

Remove Duplicates From BindingList

From Dev

Remove duplicates from column

From Dev

Remove Duplicates from MongoDB

From Dev

Remove duplicates from select

Related Related

HotTag

Archive