如何在熊猫数据框中使用2个大写字母和一个正则表达式分隔一个字符串？

debugcn 发表于 Dev

蓝尾

我在团队的数据框列中，尝试将团队名称“ CubsWhite Sox”分为两部分，“ Cubs”和“ White Sox”。

import pandas as pd
import re
data = [{'teams':'CubsWhite Sox','area':'Chicago','league': 'MLB'}, {'teams': 'Red Sox','area':'Boston', 'league': 'MLB'}, {'teams': 'Blue Jay','area':'Toronto', 'league': 'MLB'}] 

df = pd.DataFrame(data) 
df

到目前为止，我只能达到这个结果。

df["team"] = df.apply(lambda x: re.findall(r"[A-Z][^A-Z]*(?:\s[A-Z][^A-Z]*)", x["teams"]), axis=1)
df
    teams           area    league   team
0   CubsWhite Sox   Chicago MLB      [White Sox]
1   Red Sox         Boston  MLB      [Red Sox]
2   Blue Jay        Toronto MLB      [Blue Jay]

同样在白色，红色和蓝色之后，还有我从这里发现的两个空格。

df["team"] = df.apply(lambda x: re.findall(r"[A-Z0-9][^A-Z]*", x["teams"]), axis=1)
df
    teams           area    league  team
0   CubsWhite Sox   Chicago MLB     [Cubs, White , Sox]
1   Red Sox         Boston  MLB     [Red , Sox]
2   Blue Jay        Toronto MLB     [Blue , Jay]

我可以轻松删除

df['teams'] = df['teams'].str.replace(r' +', '')

您可以帮助我像这样拆分这些团队名称吗，请使用re.findall？

df
    teams           area    league  team
0   CubsWhite Sox   Chicago MLB     [Cubs, White Sox]
1   Red Sox         Boston  MLB     [Red Sox]
2   Blue Jay        Toronto MLB     [Blue Jay]

谢谢！

维克多·史翠比维

您可以使用

df['team'] = df['teams'].str.findall(r'[A-Z][a-z]*(?:\s+[A-Z][a-z]*)?')

请参阅regex演示。详细资料：

[A-Z][a-z]* -大写字母，后跟任何零个或多个小写字母
(?:\s+[A-Z][a-z]*)? -匹配的可选非捕获组
- \s+ -一个或多个空格
- [A-Z][a-z]* -大写字母，后跟任何零个或多个小写字母。

熊猫测试：

>>> df['teams'].str.findall(r'[A-Z][a-z]*(?:\s+[A-Z][a-z]*)?')
0    [Cubs, White Sox]
1            [Red Sox]
2           [Blue Jay]
Name: teams, dtype: object

本文收集自互联网，转载请注明来源。

如有侵权，请联系[email protected] 删除。