How to remove duplicates from a dataframe?

Peter Published at Dev

Peter

My current df looks like this:

    IDnumber     Subid      Subsubid     Date        Originaldataindicator
      a            1          x           2006           NaN
      a            1          x           2007           NaN
      a            1          x           2008           NaN
      a            1          x           2008           1

The originaldataindicator is the result of the fact that some of these observations were created to get all three years for each IDnumber, while some existed in the original dataset. What I want to achieve is to drop the duplicates and prefarably keep the original data. Note that the originaldataindicator will not always be the last observation. To solve this I first sort on Idnumber Date Originaldataindicator

However when I use:

  df=df.drop_duplicates(subset=['IDnumber', 'Subid', 'Subsubid', 'Date'])

Nothing happens and I still observe the duplicate.

  df=df.drop_duplicates(subset=['IDnumber', 'Subid', 'Subsubid', 'Date'], inplace=True)

gives me an empty dataframe.

Am I misinterpreting what drop_duplicates does ?

Just to avoid confusion, this is what I want:

    IDnumber     Subid      Subsubid     Date        Originaldataindicator
      a            1          x           2006           NaN
      a            1          x           2007           NaN
      a            1          x           2008           1

The data includes thousands of these ID's

jezrael

I think you need groupby and sort_values and then use parameter keep=first of drop_duplicates:

print df
  IDnumber  Subid Subsubid  Date  Originaldataindicator
0        a      1        x  2006                    NaN
1        a      1        x  2007                    NaN
2        a      1        x  2008                    NaN
3        a      1        x  2008                      1
4        a      1        x  2008                    NaN

df = df.groupby(['IDnumber', 'Subid', 'Subsubid', 'Date'])
              .apply(lambda x: x.sort_values('Originaldataindicator')).reset_index(drop=True)
print df
  IDnumber  Subid Subsubid  Date  Originaldataindicator
0        a      1        x  2006                    NaN
1        a      1        x  2007                    NaN
2        a      1        x  2008                      1
3        a      1        x  2008                    NaN
4        a      1        x  2008                    NaN

df1=df.drop_duplicates(subset=['IDnumber', 'Subid', 'Subsubid', 'Date'], keep='first')
print df1
  IDnumber  Subid Subsubid  Date  Originaldataindicator
0        a      1        x  2006                    NaN
1        a      1        x  2007                    NaN
2        a      1        x  2008                      1

Or use inplace:

df.drop_duplicates(subset=['IDnumber','Subid','Subsubid','Date'], keep='first', inplace=True)
print df
  IDnumber  Subid Subsubid  Date  Originaldataindicator
0        a      1        x  2006                    NaN
1        a      1        x  2007                    NaN
2        a      1        x  2008                      1

If column Originaldataindicator have multiple values use duplicated (maybe ther can be add all columns IDnumber,Subid,Subsubid,Date) and isnull:

print df
  IDnumber  Subid Subsubid  Date  Originaldataindicator
0        a      1        x  2006                    NaN
1        a      1        x  2007                    NaN
2        a      1        x  2008                    NaN
3        a      1        x  2008                      1
4        a      1        x  2008                      1

print df[~((df.duplicated('Date',keep=False))&~(pd.notnull(df['Originaldataindicator'])))]
  IDnumber  Subid Subsubid  Date  Originaldataindicator
0        a      1        x  2006                    NaN
1        a      1        x  2007                    NaN
3        a      1        x  2008                      1
4        a      1        x  2008                      1

Explaining conditions:

print df.duplicated('Date', keep=False)
0    False
1    False
2     True
3     True
4     True
dtype: bool

print (pd.isnull(df['Originaldataindicator']))
0     True
1     True
2     True
3    False
4    False
Name: Originaldataindicator, dtype: bool

print ~((df.duplicated('Date', keep=False)) & (pd.isnull(df['Originaldataindicator'])))
0     True
1     True
2    False
3     True
4     True
dtype: bool

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at2021-02-25

Comments

0 comments

From Dev

Related Related

Article

How to remove duplicates from a dataframe?

How to remove duplicates from a dataframe?

remove duplicates from a dataframe in pyspark

Remove similar duplicates from dataframe

How to Conditionally Remove Duplicates from Pandas DataFrame with a List

how to remove duplicates from a set?

How to remove rows with duplicates in pandas dataframe?

How to update a dataframe, from another dataframe with duplicates

How to remove all duplicates from a list?

How to "remove duplicates" from a UNION query

How to remove duplicates from MongoDB Collection

How to remove duplicates from a list of object

How to remove duplicates from a list python

How to remove duplicates from a parallel array in Java?

How to remove duplicates from json in PHP

How to remove all duplicates from a list

How to remove duplicates from my array of structs?

How to "remove duplicates" from a UNION query

How to remove duplicates from a parallel array in Java?

How to remove duplicates from a list of object

How to remove all of the duplicates from the table?

How to remove duplicates from Multivalued Fields in SOLR?

How do I remove duplicates from an array

How remove duplicates from sql list?

How to remove duplicates from circular linked list

How to remove efficiently all duplicates in dataframe or csv file in python?

How to remove efficiently all duplicates in dataframe or csv file in python?

Remove Duplicates From BindingList

Remove duplicates from column

Remove Duplicates from MongoDB

Remove duplicates from select