Deleting rows from Pandas dataframe based on groupby values

user1718097 Published at Dev

user1718097

I have a large Pandas dataframe (> 1 million rows) that I have retrieved from a SQL Server database. In a small number of cases, some of the records have duplicate entries. All cells are identical except for a single, text field. It looks as though the record has been entered into the database and then, at a later time, additional text has been added to the field and the record stored in the database as a separate entry. So basically, I want to keep only the record with the longest text string. A simplified version of the database can be created as follows:

tempDF = pd.DataFrame({ 'recordID': [1,2,3,3,4,5,6,6,6,7,7,8,9,10],
                        'text': ['abc', 'def', 'ghi', 'ghijkl', 'mto', 'per', 'st', 'stuvw', 'stuvwx', 'yz', 'yzab', 'cde', 'fgh', 'ijk']})

Which looks like this:

    recordID    text
0         21     abc
1         22     def
2         23     ghi
3         23  ghijkl
4         24     mno
5         25     pqr
6         26      st
7         26   stuvw
8         26  stuvwx
9         27      yz
10        27    yzab
11        28     cde
12        29     fgh
13        30     ijk

So far, I've identified the rows with duplicate recordID and calculated the length of the text field:

tempDF['dupl'] = tempDF.duplicated(subset = 'recordID',keep=False)
tempDF['texLen'] = tempDF['text'].str.len()
print(tempDF)

To produce:

    recordID    text   dupl  texLen
0         21     abc  False       3
1         22     def  False       3
2         23     ghi   True       3
3         23  ghijkl   True       6
4         24     mno  False       3
5         25     pqr  False       3
6         26      st   True       2
7         26   stuvw   True       5
8         26  stuvwx   True       6
9         27      yz   True       2
10        27    yzab   True       4
11        28     cde  False       3
12        29     fgh  False       3
13        30     ijk  False       3

I can groupby all the dupl==True records based on recordID using:

tempGrouped = tempDF[tempDF['dupl']==True].groupby('recordID')

And print off each group separately:

for name, group in tempGrouped:
    print('n',name)
    print(group)

23
   recordID    text  dupl  texLen
2        23     ghi  True       3
3        23  ghijkl  True       6

26
   recordID    text  dupl  texLen
6        26      st  True       2
7        26   stuvw  True       5
8        26  stuvwx  True       6

27
    recordID  text  dupl  texLen
9         27    yz  True       2
10        27  yzab  True       4

I want the final dataframe to consist of those records where dupl==False and, if dupl==True then only the replicate with the longest text field should be retained. So, the final dataframe should look like:

    recordID    text   dupl  texLen
0         21     abc  False       3
1         22     def  False       3
3         23  ghijkl   True       6
4         24     mno  False       3
5         25     pqr  False       3
8         26  stuvwx   True       6
10        27    yzab   True       4
11        28     cde  False       3
12        29     fgh  False       3
13        30     ijk  False       3

How can I delete from the original dataframe only those rows where recordID is duplicated and where texLen is less than the maximum?

jezrael

You can try find indexes with max values by idxmax, concat with False values in dupl column and last sort_index:

idx = tempDF[tempDF['dupl']==True].groupby('recordID')['texLen'].idxmax()   

print tempDF.loc[idx]
    recordID    text  dupl  texLen
3         23  ghijkl  True       6
8         26  stuvwx  True       6
10        27    yzab  True       4

print pd.concat([tempDF[tempDF['dupl']==False], tempDF.loc[idx]]).sort_index(0)
    recordID    text   dupl  texLen
0         21     abc  False       3
1         22     def  False       3
3         23  ghijkl   True       6
4         24     mto  False       3
5         25     per  False       3
8         26  stuvwx   True       6
10        27    yzab   True       4
11        28     cde  False       3
12        29     fgh  False       3
13        30     ijk  False       3

The simplier solution use sort_values and first, because rows with False have unique recordID (are NOT duplicated):

df=tempDF.sort_values(by="texLen", ascending=False).groupby("recordID").first().reset_index()
print df   
   recordID    text   dupl  texLen
0        21     abc  False       3
1        22     def  False       3
2        23  ghijkl   True       6
3        24     mto  False       3
4        25     per  False       3
5        26  stuvwx   True       6
6        27    yzab   True       4
7        28     cde  False       3
8        29     fgh  False       3
9        30     ijk  False       3

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at2021-02-26

Comments

0 comments

From Dev

Related Related

Article

Deleting rows from Pandas dataframe based on groupby values

Deleting rows from Pandas dataframe based on groupby values

Select CONSECUTIVE rows from a DataFrame based on values in a column in Pandas with Groupby

Deleting DataFrame rows in Pandas based on column value - multiple values to remove

Deleting Specific Rows from Pandas Dataframe

Drop pandas dataframe rows based on groupby() condition

Selecting rows from a Dataframe based on values in multiple columns in pandas

Select rows from a DataFrame based on multiple values in a column in pandas

Select rows from a DataFrame based on last characters of values in a column in pandas

Update rows in Pandas Dataframe based on the list values

Insert rows based on values pandas dataframe

Groupby Pandas dataframe and drop values conditionally based on rank

How to create new rows from values inside in a cloumn of pandas dataframe based on delimeter in Python?

Filtering rows from dataframe based on the values of the previous rows

Deleting Rows Based on Multiple Cell Values

Select rows of pandas dataframe based on column values with duplicates

Pandas: Change dataframe values based on dictionary and remove rows with no match

selecting rows based on multiple column values in pandas dataframe

How to assign unique values to groups of rows in a pandas dataframe based on a condition?

Pandas dataframe, select n random rows based on number of unique values

Pandas - Python, deleting rows based on Date column

Deleting rows based on multiple conditions Python Pandas

Split pandas dataframe based on groupby

How generate all pairs of values, from the result of a groupby, in a pandas dataframe

Deleting DataFrame row in Pandas based on column value

Deleting a row in pandas dataframe based on condition

Assign values to columns in Pandas Dataframe based on data from another dataframe

Overwriting values in a pandas dataframe based on NA values from a second one

How to select rows from a DataFrame based on column values

Select rows from grouped dataframe based on duplicate values

Deleting a row based values of of a column using pandas