有没有一种方法可以查询DataFrame中任何列中包含特定字符串的行?Series.str
除了DataFrame之外,还有其他类似的东西吗?这是我到目前为止的内容:
In [2]: s = "Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est"
In [3]: df = pd.DataFrame(np.array(s.split(' ')).reshape((-1, 4)), columns=['one', 'two', 'three', 'four'])
In [4]: df
Out[4]:
one two three four
0 Lorem ipsum dolor sit
1 amet, consectetur adipisicing elit,
2 sed do eiusmod tempor
3 incididunt ut labore et
4 dolore magna aliqua. Ut
5 enim ad minim veniam,
6 quis nostrud exercitation ullamco
7 laboris nisi ut aliquip
8 ex ea commodo consequat.
9 Duis aute irure dolor
10 in reprehenderit in voluptate
11 velit esse cillum dolore
12 eu fugiat nulla pariatur.
13 Excepteur sint occaecat cupidatat
14 non proident, sunt in
15 culpa qui officia deserunt
16 mollit anim id est
[17 rows x 4 columns]
In [5]: mask = df['one'].str.contains('dolor') | df['two'].str.contains('dolor') | df['three'].str.contains('dolor') | df['four'].str.contains('dolor')
In [6]: df[mask]
Out[6]:
one two three four
0 Lorem ipsum dolor sit
4 dolore magna aliqua. Ut
9 Duis aute irure dolor
11 velit esse cillum dolore
[4 rows x 4 columns]
理想情况下,我想用类似于以下内容替换最后两行:
df[df.ix[:, 'one':'four'].str.contains('dolor')]
这可能吗?
熊猫没有DataFrame.str方法(至少现在还没有)。但是,您可以使用
import numpy as np
mask = np.logical_or.reduce(
[df[col].str.contains('dolor')
for col in df.loc[:, 'one':'four'].columns])
这比编写少了一点,而且比
mask = df['one'].str.contains('dolor') | df['two'].str.contains('dolor') | df['three'].str.contains('dolor') | df['four'].str.contains('dolor')
In [29]: %timeit mask = np.logical_or.reduce([df[col].str.contains('dolor') for col in df.loc[:, 'one':'four'].columns]); df[mask]
1000 loops, best of 3: 761 µs per loop
In [30]: %timeit mask = df['one'].str.contains('dolor') | df['two'].str.contains('dolor') | df['three'].str.contains('dolor') | df['four'].str.contains('dolor'); df[mask]
1000 loops, best of 3: 1.13 ms per loop
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句