Deleting rows from Pandas dataframe based on groupby values

user1718097

I have a large Pandas dataframe (> 1 million rows) that I have retrieved from a SQL Server database. In a small number of cases, some of the records have duplicate entries. All cells are identical except for a single, text field. It looks as though the record has been entered into the database and then, at a later time, additional text has been added to the field and the record stored in the database as a separate entry. So basically, I want to keep only the record with the longest text string. A simplified version of the database can be created as follows:

tempDF = pd.DataFrame({ 'recordID': [1,2,3,3,4,5,6,6,6,7,7,8,9,10],
                        'text': ['abc', 'def', 'ghi', 'ghijkl', 'mto', 'per', 'st', 'stuvw', 'stuvwx', 'yz', 'yzab', 'cde', 'fgh', 'ijk']})

Which looks like this:

    recordID    text
0         21     abc
1         22     def
2         23     ghi
3         23  ghijkl
4         24     mno
5         25     pqr
6         26      st
7         26   stuvw
8         26  stuvwx
9         27      yz
10        27    yzab
11        28     cde
12        29     fgh
13        30     ijk

So far, I've identified the rows with duplicate recordID and calculated the length of the text field:

tempDF['dupl'] = tempDF.duplicated(subset = 'recordID',keep=False)
tempDF['texLen'] = tempDF['text'].str.len()
print(tempDF)

To produce:

    recordID    text   dupl  texLen
0         21     abc  False       3
1         22     def  False       3
2         23     ghi   True       3
3         23  ghijkl   True       6
4         24     mno  False       3
5         25     pqr  False       3
6         26      st   True       2
7         26   stuvw   True       5
8         26  stuvwx   True       6
9         27      yz   True       2
10        27    yzab   True       4
11        28     cde  False       3
12        29     fgh  False       3
13        30     ijk  False       3

I can groupby all the dupl==True records based on recordID using:

tempGrouped = tempDF[tempDF['dupl']==True].groupby('recordID')

And print off each group separately:

for name, group in tempGrouped:
    print('n',name)
    print(group)

23
   recordID    text  dupl  texLen
2        23     ghi  True       3
3        23  ghijkl  True       6

26
   recordID    text  dupl  texLen
6        26      st  True       2
7        26   stuvw  True       5
8        26  stuvwx  True       6

27
    recordID  text  dupl  texLen
9         27    yz  True       2
10        27  yzab  True       4

I want the final dataframe to consist of those records where dupl==False and, if dupl==True then only the replicate with the longest text field should be retained. So, the final dataframe should look like:

    recordID    text   dupl  texLen
0         21     abc  False       3
1         22     def  False       3
3         23  ghijkl   True       6
4         24     mno  False       3
5         25     pqr  False       3
8         26  stuvwx   True       6
10        27    yzab   True       4
11        28     cde  False       3
12        29     fgh  False       3
13        30     ijk  False       3

How can I delete from the original dataframe only those rows where recordID is duplicated and where texLen is less than the maximum?

jezrael

You can try find indexes with max values by idxmax, concat with False values in dupl column and last sort_index:

idx = tempDF[tempDF['dupl']==True].groupby('recordID')['texLen'].idxmax()   

print tempDF.loc[idx]
    recordID    text  dupl  texLen
3         23  ghijkl  True       6
8         26  stuvwx  True       6
10        27    yzab  True       4

print pd.concat([tempDF[tempDF['dupl']==False], tempDF.loc[idx]]).sort_index(0)
    recordID    text   dupl  texLen
0         21     abc  False       3
1         22     def  False       3
3         23  ghijkl   True       6
4         24     mto  False       3
5         25     per  False       3
8         26  stuvwx   True       6
10        27    yzab   True       4
11        28     cde  False       3
12        29     fgh  False       3
13        30     ijk  False       3

The simplier solution use sort_values and first, because rows with False have unique recordID (are NOT duplicated):

df=tempDF.sort_values(by="texLen", ascending=False).groupby("recordID").first().reset_index()
print df   
   recordID    text   dupl  texLen
0        21     abc  False       3
1        22     def  False       3
2        23  ghijkl   True       6
3        24     mto  False       3
4        25     per  False       3
5        26  stuvwx   True       6
6        27    yzab   True       4
7        28     cde  False       3
8        29     fgh  False       3
9        30     ijk  False       3

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

From Dev

Select CONSECUTIVE rows from a DataFrame based on values in a column in Pandas with Groupby

From Dev

Deleting DataFrame rows in Pandas based on column value - multiple values to remove

From Dev

Deleting Specific Rows from Pandas Dataframe

From Dev

Drop pandas dataframe rows based on groupby() condition

From Dev

Selecting rows from a Dataframe based on values in multiple columns in pandas

From Dev

Select rows from a DataFrame based on multiple values in a column in pandas

From Dev

Select rows from a DataFrame based on last characters of values in a column in pandas

From Java

Update rows in Pandas Dataframe based on the list values

From Dev

Insert rows based on values pandas dataframe

From Dev

Groupby Pandas dataframe and drop values conditionally based on rank

From Dev

How to create new rows from values inside in a cloumn of pandas dataframe based on delimeter in Python?

From Dev

Filtering rows from dataframe based on the values of the previous rows

From Dev

Deleting Rows Based on Multiple Cell Values

From Java

Select rows of pandas dataframe based on column values with duplicates

From Dev

Pandas: Change dataframe values based on dictionary and remove rows with no match

From Dev

selecting rows based on multiple column values in pandas dataframe

From Dev

How to assign unique values to groups of rows in a pandas dataframe based on a condition?

From Dev

Pandas dataframe, select n random rows based on number of unique values

From Dev

Pandas - Python, deleting rows based on Date column

From Dev

Deleting rows based on multiple conditions Python Pandas

From Dev

Split pandas dataframe based on groupby

From Dev

How generate all pairs of values, from the result of a groupby, in a pandas dataframe

From Java

Deleting DataFrame row in Pandas based on column value

From Dev

Deleting a row in pandas dataframe based on condition

From Dev

Assign values to columns in Pandas Dataframe based on data from another dataframe

From Dev

Overwriting values in a pandas dataframe based on NA values from a second one

From Java

How to select rows from a DataFrame based on column values

From Dev

Select rows from grouped dataframe based on duplicate values

From Dev

Deleting a row based values of of a column using pandas

Related Related

  1. 1

    Select CONSECUTIVE rows from a DataFrame based on values in a column in Pandas with Groupby

  2. 2

    Deleting DataFrame rows in Pandas based on column value - multiple values to remove

  3. 3

    Deleting Specific Rows from Pandas Dataframe

  4. 4

    Drop pandas dataframe rows based on groupby() condition

  5. 5

    Selecting rows from a Dataframe based on values in multiple columns in pandas

  6. 6

    Select rows from a DataFrame based on multiple values in a column in pandas

  7. 7

    Select rows from a DataFrame based on last characters of values in a column in pandas

  8. 8

    Update rows in Pandas Dataframe based on the list values

  9. 9

    Insert rows based on values pandas dataframe

  10. 10

    Groupby Pandas dataframe and drop values conditionally based on rank

  11. 11

    How to create new rows from values inside in a cloumn of pandas dataframe based on delimeter in Python?

  12. 12

    Filtering rows from dataframe based on the values of the previous rows

  13. 13

    Deleting Rows Based on Multiple Cell Values

  14. 14

    Select rows of pandas dataframe based on column values with duplicates

  15. 15

    Pandas: Change dataframe values based on dictionary and remove rows with no match

  16. 16

    selecting rows based on multiple column values in pandas dataframe

  17. 17

    How to assign unique values to groups of rows in a pandas dataframe based on a condition?

  18. 18

    Pandas dataframe, select n random rows based on number of unique values

  19. 19

    Pandas - Python, deleting rows based on Date column

  20. 20

    Deleting rows based on multiple conditions Python Pandas

  21. 21

    Split pandas dataframe based on groupby

  22. 22

    How generate all pairs of values, from the result of a groupby, in a pandas dataframe

  23. 23

    Deleting DataFrame row in Pandas based on column value

  24. 24

    Deleting a row in pandas dataframe based on condition

  25. 25

    Assign values to columns in Pandas Dataframe based on data from another dataframe

  26. 26

    Overwriting values in a pandas dataframe based on NA values from a second one

  27. 27

    How to select rows from a DataFrame based on column values

  28. 28

    Select rows from grouped dataframe based on duplicate values

  29. 29

    Deleting a row based values of of a column using pandas

HotTag

Archive