Pandas: Delete rows of a DataFrame if total count of a particular column occurs only 1 time

Levine Published at Dev

Levine

I'm looking to delete rows of a DataFrame if total count of a particular column occurs only 1 time

Example of raw table (values are arbitrary for illustrative purposes):

print df

     Country     Series          Value
0    Bolivia     Population      123
1    Kenya       Population      1234
2    Ukraine     Population      12345
3    US          Population      123456
5    Bolivia     GDP             23456
6    Kenya       GDP             234567
7    Ukraine     GDP             2345678
8    US          GDP             23456789
9    Bolivia     #McDonalds      3456
10   Kenya       #Schools        3455
11   Ukraine     #Cars           3456
12   US          #Tshirts        3456789

Intended outcome:

print df

     Country     Series          Value
0    Bolivia     Population      123
1    Kenya       Population      1234
2    Ukraine     Population      12345
3    US          Population      123456
5    Bolivia     GDP             23456
6    Kenya       GDP             234567
7    Ukraine     GDP             2345678
8    US          GDP             23456789

I know that df.Series.value_counts()>1 will identify which df.Series occur more than 1 time; and that the code returned will look something like the following:

     Population 
           True
     GDP
           True
     #McDonalds
          False
     #Schools
          False
     #Cars
          False
     #Tshirts
          False

I want to write something like the following so that my new DataFrame drops column values from df.Series that occur only 1 time, but this doesn't work: df.drop(df.Series.value_counts()==1,axis=1,inplace=True)

Gustavo Bezerra

You can do this by creating a boolean list/array by either list comprehensions or using DataFrame's string manipulation methods.

The list comprehension approach is:

vc = df['Series'].value_counts()
u  = [i not in set(vc[vc==1].index) for i in df['Series']]
df = df[u]

The other approach is to use the str.contains method to check whether the values of the Series column contain a given string or match a given regular expression (used in this case as we are using multiple strings):

vc  = df['Series'].value_counts()
pat = r'|'.join(vc[vc==1].index)          #Regular expression
df  = df[~df['Series'].str.contains(pat)] #Tilde is to negate boolean

Using this regular expressions approach is a bit more hackish and may require some extra processing (character escaping, etc) on pat in case you have regex metacharacters in the strings you want to filter out (which requires some basic regex knowledge). However, it's worth noting this approach is about 4x faster than using the list comprehension approach (tested on the data provided in the question).

As a side note, I recommend avoiding using the word Series as a column name as that's the name of a pandas object.

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at2021-02-25

Comments

0 comments

From Dev

How to count the number of times a item/value from a particular column is repeated in another/other column of a pandas dataframe?

From Java

Count number of times each item in list occurs in a pandas dataframe column with comma separates values with additional aggregation of other columns

Related Related

Article