在pandas dataframe列中搜索特定的字符串集，然后搜索该字符串

debugcn 发表于 Dev

梅丽·唐纳德

我想在特定列中搜索一组值。如果发生匹配，则返回匹配的字符串。目前，我只能得到是非。步骤如下：

创建df：

Cars = {'Brand': ['Honda Civic','Toyota Corolla','Ford Focus','Audi A4', np.nan],
    'Price': [22000,25000,27000,35000, 29000],
    'Liscence Plate': ['ABC 123', 'XYZ 789', 'CBA 321', 'ZYX 987', 'DEF 456']}

df = pd.DataFrame(Cars,columns= ['Brand', 'Price', 'Liscence Plate'])

搜索特定的一组值：

search_for_these_values = ['Honda', 'Toy', 'Ford Focus', 'Audi A4 2019']
pattern = '|'.join(search_for_these_values)
df['Match'] = df["Brand"].str.contains(pattern, na=False)

打印df：

print(df)
Brand   Price           Liscence Plate      Match
0       Honda Civic     22000    ABC 123    True
1       Toyota Corolla  25000    XYZ 789    True
2       Ford Focus      27000    CBA 321    True
3       Audi A4         35000    ZYX 987    False
4       NaN             29000    DEF 456    False

我想为“匹配”列提供以下内容：

Brand   Price           Liscence Plate      Match
0       Honda Civic     22000    ABC 123    Honda
1       Toyota Corolla  25000    XYZ 789    Toy
2       Ford Focus      27000    CBA 321    Ford Focus
3       Audi A4         35000    ZYX 987    
4       NaN             29000    DEF 456

维克多·史翠比维

您可以使用

pattern = r'({})'.format('|'.join(sorted(search_for_these_values, key=len, reverse=True)))
df['Match'] = df["Brand"].str.extract(pattern, expand=False)

输出：

>>> df
            Brand  Price Liscence Plate       Match
0     Honda Civic  22000        ABC 123       Honda
1  Toyota Corolla  25000        XYZ 789         Toy
2      Ford Focus  27000        CBA 321  Ford Focus
3         Audi A4  35000        ZYX 987         NaN
4             NaN  29000        DEF 456         NaN

详细资料：

sorted(search_for_these_values, key=len, reverse=True) -由于您的关键字包含多字词条目，因此您需要首先确保较长的词条在出现的交替模式中排在较短的词条之前（因为在NFA regex中，第一个匹配项“ wins”匹配，并且regex库停止搜索其余词条）当前位置的替代方案）
'|'.join(...) -交替模式是根据已排序的关键字构建的
r'({})'.format(...)-替换项包含一个Series.str.extract正常工作所必需的捕获组（仅当正则表达式模式中至少有一个捕获组时，才输出结果）。