I have a pandas dataframe with several columns that labels data in a final column, for example,
df = pd.DataFrame( {'1_label' : ['a1','b1','c1','d1'],
'2_label' : ['a2','b2','c2','d2'],
'3_label' : ['a3','b3','c3','d3'],
'data' : [1,2,3,4]})
df = 1_label 2_label 3_label data
0 a1 a2 a3 1
1 b1 b2 b3 2
2 c1 c2 c3 3
3 d1 d2 d3 4
and a list of tuples,
list_t = [('a1','a2','a3'), ('d1','d2','d3')]
I want to filter this dataframe and return a new dataframe containing only the rows that correspond to the tuples in my list.
result = 1_label 2_label 3_label data
0 a1 a2 a3 1
1 d1 d2 d3 4
My naive (and C++ inspired) solution was to use append (like vector::push_back)
for l1, l2, l3 in list_t:
if df[(df['1_label'] == l1) &
(df['2_label'] == l2) &
(df['3_label'] == l3)].empty is False:
result = result.append(df[(df['1_label'] == l1) &
(df['2_label'] == l2) &
(df['3_label'] == l3)]
While my solution works I suspect it is horrendously slow for large dataframes and large list of tuples as I think pandas creates a new dataframe upon each call to append. Could anyone suggest a faster/cleaner way to do this? Thanks!
Assuming no duplicates, you could create index out of the columns you want to "filter" on:
In [10]: df
Out[10]:
1_label 2_label 3_label data
0 a1 a2 a3 1
1 b1 b2 b3 2
2 c1 c2 c3 3
3 d1 d2 d3 4
In [11]: df.set_index(['1_label', '2_label', '3_label'])\
.loc[[('a1','a2','a3'), ('d1','d2','d3')]]\
.reset_index()
Out[11]:
1_label 2_label 3_label data
0 a1 a2 a3 1
1 d1 d2 d3 4
Collected from the Internet
Please contact [email protected] to delete if infringement.
Comments