How to remove duplicates in pandas?

Behzat

I have lots of data in excel files. I would like to concatenate these datas into one excel file by removing duplicate records according to id column information.

df1 
   id    name   date 
0   1    cab    2017
1  11    den    2012 
2  13    ers    1998


df2 
   id    name   date 
0  11    den    2012
1  14    ces    2011 
2   4    guk    2007

I want to have below concantenated file finally.

Concat df
   id    name   date 
0   1    cab    2017
1  11    den    2012 
2  13    ers    1998
1  14    ces    2011 
2   4    guk    2007

I try below but it does not remove duplicates. Can anyone advise how to fix this ?

pd.concat([df1,df2]).drop_duplicates().reset_index(drop=True)

My concatenated data are as below. Duplicated ids are still on the file.

    id                  created_at          retweet_count
0   721557296757797000  2016-04-17 04:34:00 21
1   721497712726844000  2016-04-17 00:37:14 94
2   721462059515453000  2016-04-16 22:15:33 0
3   721460623285072000  2016-04-16 22:09:51 0
4   721460397241446000  2016-04-16 22:08:57 0
5   721459817651577000  2016-04-16 22:06:39 0
6   721456334894469000  2016-04-16 21:52:48 0
7   721557296757797000  2016-04-17 04:34:00 21
8   721497712726844000  2016-04-17 00:37:14 94
9   721462059515453000  2016-04-16 22:15:33 0
10  721460623285072000  2016-04-16 22:09:51 0
11  721460397241446000  2016-04-16 22:08:57 0
12  721459817651577000  2016-04-16 22:06:39 0
13  721456334894469000  2016-04-16 21:52:48 0
jezrael

I think you need add parameter subset to drop_duplicates for filtering by column id:

print pd.concat([df1,df2]).drop_duplicates(subset='id').reset_index(drop=True)
   id name  date
0   1  cab  2017
1  11  den  2012
2  13  ers  1998
3  14  ces  2011
4   4  guk  2007

EDIT:

I try your new data and for me it works:

import pandas as pd

df = pd.DataFrame({'created_at': {0: '2016-04-17 04:34:00', 1: '2016-04-17 00:37:14', 2: '2016-04-16 22:15:33', 3: '2016-04-16 22:09:51', 4: '2016-04-16 22:08:57', 5: '2016-04-16 22:06:39', 6: '2016-04-16 21:52:48', 7: '2016-04-17 04:34:00', 8: '2016-04-17 00:37:14', 9: '2016-04-16 22:15:33', 10: '2016-04-16 22:09:51', 11: '2016-04-16 22:08:57', 12: '2016-04-16 22:06:39', 13: '2016-04-16 21:52:48'}, 'retweet_count': {0: 21, 1: 94, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 21, 8: 94, 9: 0, 10: 0, 11: 0, 12: 0, 13: 0}, 'id': {0: 721557296757797000, 1: 721497712726844000, 2: 721462059515453000, 3: 721460623285072000, 4: 721460397241446000, 5: 721459817651577000, 6: 721456334894469000, 7: 721557296757797000, 8: 721497712726844000, 9: 721462059515453000, 10: 721460623285072000, 11: 721460397241446000, 12: 721459817651577000, 13: 721456334894469000}},
                  columns=['id','created_at','retweet_count'])
print df
                    id           created_at  retweet_count
0   721557296757797000  2016-04-17 04:34:00             21
1   721497712726844000  2016-04-17 00:37:14             94
2   721462059515453000  2016-04-16 22:15:33              0
3   721460623285072000  2016-04-16 22:09:51              0
4   721460397241446000  2016-04-16 22:08:57              0
5   721459817651577000  2016-04-16 22:06:39              0
6   721456334894469000  2016-04-16 21:52:48              0
7   721557296757797000  2016-04-17 04:34:00             21
8   721497712726844000  2016-04-17 00:37:14             94
9   721462059515453000  2016-04-16 22:15:33              0
10  721460623285072000  2016-04-16 22:09:51              0
11  721460397241446000  2016-04-16 22:08:57              0
12  721459817651577000  2016-04-16 22:06:39              0
13  721456334894469000  2016-04-16 21:52:48              0

print df.dtypes

id                int64
created_at       object
retweet_count     int64
dtype: object


print df.drop_duplicates(subset='id').reset_index(drop=True)
                   id           created_at  retweet_count
0  721557296757797000  2016-04-17 04:34:00             21
1  721497712726844000  2016-04-17 00:37:14             94
2  721462059515453000  2016-04-16 22:15:33              0
3  721460623285072000  2016-04-16 22:09:51              0
4  721460397241446000  2016-04-16 22:08:57              0
5  721459817651577000  2016-04-16 22:06:39              0
6  721456334894469000  2016-04-16 21:52:48              0

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

From Dev

How to remove duplicates in pandas?

From Dev

How to remove rows with duplicates in pandas dataframe?

From Dev

Remove Quasi Duplicates In Pandas

From Dev

How to Conditionally Remove Duplicates from Pandas DataFrame with a List

From Dev

How to remove duplicates in ListBox?

From Dev

How to remove duplicates

From Dev

How to remove duplicates

From Dev

How to remove command duplicates?

From Dev

How to remove duplicates in a document?

From Dev

Powershell : How to Remove Duplicates

From Dev

How to remove duplicates from a dataframe?

From Java

How to remove Duplicates in .txt file

From Dev

how to remove duplicates from a set?

From Dev

How to remove duplicates in a complicated JOIN

From Dev

How to remove duplicates in SQL Server

From Dev

How to remove duplicates and originals in excel

From Dev

Combine rows to remove duplicates in CSV Python and Pandas

From Dev

How to find duplicates in pandas dataframe

From Dev

How to flatten a list of tuples and remove the duplicates?

From Dev

How to remove all duplicates from a list?

From Dev

How to find & remove duplicates in data frames?

From Dev

How to "remove duplicates" from a UNION query

From Dev

How to remove BOTH duplicates in MS ACCESS

From Dev

How to remove duplicates by column (inverse ordering)

From Dev

How to combine multiple querysets and remove duplicates?

From Dev

C# How to filter a list and remove duplicates?

From Dev

angularjs how to remove duplicates by multiple select

From Dev

How to remove cross duplicates at query level

From Dev

How to remove duplicates from MongoDB Collection