I have a data frame. For all possible combination of values in the first two columns, I would like to delete those rows for which the number of rows smaller than 100.
For example, there are 5 rows in which in the first column we have "A" and "B" in the second column. All these rows I would like to delete from the data frame. There are 110 rows in which the first and the second rows contains "C" and "D", respectively. These rows I do not want to delete since 110 > 5.
What is the most elegant and fast way to do that?
This is the solution that I have at the moment:
gr = df.groupby(['L_ID', 'P_ID'])
for group in gr.groups:
df_tmp = gr.get_group(group)
n_vals = len(df_tmp)
if n_vals < min_n:
df = df[(df['L_ID'] != group[0]) | (df['P_ID'] != group[1])]
You can use filter()
method:
# test data
>>> df1 = pd.DataFrame({'a':list('AAABB'), 'b':list('BBBAA'), 'c':range(5)})
>>> df1
a b c
0 A B 0
1 A B 1
2 A B 2
3 B A 3
4 B A 4
>>> df1.groupby(['a','b']).filter(lambda x: len(x) > 2)
a b c
0 A B 0
1 A B 1
2 A B 2
Looks like this method is not working when there're more columns:
>>> df1 = pd.DataFrame({'a':list('AAABB'), 'b':list('BBBAA'), 'c':range(5), 'd':range(5)})
>>> df1
a b c d
0 A B 0 0
1 A B 1 1
2 A B 2 2
3 B A 3 3
4 B A 4 4
>>> df1.groupby(['a','b']).filter(lambda x: len(x) > 2)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\site-packages\pandas\core\groupby.py", line 2094, in filter
if res:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
Here's a solution:
>>> df1.groupby(['a','b']).filter(lambda x: len(x['c']) > 2)
a b c d
0 A B 0 0
1 A B 1 1
2 A B 2 2
You can also use transform()
:
>>> df1[df1.groupby(['a','b'])['c'].transform(lambda x: len(x) > 2).astype(bool)]
a b c d
0 A B 0 0
1 A B 1 1
2 A B 2 2
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句