Remove Quasi Duplicates In Pandas

debugcn Published at Dev

Jeff Saltfist

I have a Pandas data frame that looks as follows:

import pandas as pd
data = pd.read_csv('C:\Users\Frank\Desktop\\10-25-16-54-7-IMPORT.csv', index_col=False)
print data.head(10)

   Date                                 Symbol   
0  2015-03-18 01:54:35 UTC              NKTR             -0.290   
1  2015-03-18 02:10:49 UTC               DRQ             -0.082   
2  2015-03-18 03:03:10 UTC              NKTR             -0.290   
3  2015-03-18 03:13:17 UTC               UAM              0.414   
4  2015-03-18 03:48:24 UTC              ROCK              0.000   
5  2015-03-18 03:56:30 UTC              ROCK              0.000   
6  2015-03-18 04:52:24 UTC               MTZ             -0.290   
7  2015-03-18 05:00:29 UTC              NKTR             -0.290   
8  2015-03-18 05:04:31 UTC              NKTR             -0.290   
9  2015-03-18 05:29:48 UTC              PSEC             -0.046

I want to remove every row with a duplicate symbol (in this case "NKTR") that occurs subsequent to the first instance of that same symbol on that same day. Is this possible?

(removing duplicates will not work because of the different time stamp of the rows).

Psidom

You can try groupby() the date of the Date column and Symbol then take the first row of each group:

import pandas as pd
df.groupby([pd.to_datetime(df.Date).dt.date, 'Symbol'], as_index=False).first()

#  Symbol                      Date  Value
#0    DRQ   2015-03-18 02:10:49 UTC -0.082
#1    MTZ   2015-03-18 04:52:24 UTC -0.290
#2   NKTR   2015-03-18 01:54:35 UTC -0.290
#3   PSEC   2015-03-18 05:29:48 UTC -0.046
#4   ROCK   2015-03-18 03:48:24 UTC  0.000
#5    UAM   2015-03-18 03:13:17 UTC  0.414

Collected from the Internet

Please contact [email protected] to delete if infringement.