我在团队的数据框列中,尝试将团队名称“ CubsWhite Sox”分为两部分,“ Cubs”和“ White Sox”。
import pandas as pd
import re
data = [{'teams':'CubsWhite Sox','area':'Chicago','league': 'MLB'}, {'teams': 'Red Sox','area':'Boston', 'league': 'MLB'}, {'teams': 'Blue Jay','area':'Toronto', 'league': 'MLB'}]
df = pd.DataFrame(data)
df
到目前为止,我只能达到这个结果。
df["team"] = df.apply(lambda x: re.findall(r"[A-Z][^A-Z]*(?:\s[A-Z][^A-Z]*)", x["teams"]), axis=1)
df
teams area league team
0 CubsWhite Sox Chicago MLB [White Sox]
1 Red Sox Boston MLB [Red Sox]
2 Blue Jay Toronto MLB [Blue Jay]
同样在白色,红色和蓝色之后,还有我从这里发现的两个空格。
df["team"] = df.apply(lambda x: re.findall(r"[A-Z0-9][^A-Z]*", x["teams"]), axis=1)
df
teams area league team
0 CubsWhite Sox Chicago MLB [Cubs, White , Sox]
1 Red Sox Boston MLB [Red , Sox]
2 Blue Jay Toronto MLB [Blue , Jay]
我可以轻松删除
df['teams'] = df['teams'].str.replace(r' +', '')
您可以帮助我像这样拆分这些团队名称吗,请使用re.findall?
df
teams area league team
0 CubsWhite Sox Chicago MLB [Cubs, White Sox]
1 Red Sox Boston MLB [Red Sox]
2 Blue Jay Toronto MLB [Blue Jay]
谢谢!
您可以使用
df['team'] = df['teams'].str.findall(r'[A-Z][a-z]*(?:\s+[A-Z][a-z]*)?')
请参阅regex演示。详细资料:
[A-Z][a-z]*
-大写字母,后跟任何零个或多个小写字母(?:\s+[A-Z][a-z]*)?
-匹配的可选非捕获组
\s+
-一个或多个空格[A-Z][a-z]*
-大写字母,后跟任何零个或多个小写字母。熊猫测试:
>>> df['teams'].str.findall(r'[A-Z][a-z]*(?:\s+[A-Z][a-z]*)?')
0 [Cubs, White Sox]
1 [Red Sox]
2 [Blue Jay]
Name: teams, dtype: object
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句