如何在 pandas 数据帧中用正则表达式分隔带有 2 个大写字母和 space 的字符串?
How to separate a string with 2 uppercases and a space with regex in pandas dataframe?
我有一个数据框列,团队,我试图将团队名称 'CubsWhite Sox' 分成两部分,'Cubs' 和 'White Sox'。
import pandas as pd
import re
data = [{'teams':'CubsWhite Sox','area':'Chicago','league': 'MLB'}, {'teams': 'Red Sox','area':'Boston', 'league': 'MLB'}, {'teams': 'Blue Jay','area':'Toronto', 'league': 'MLB'}]
df = pd.DataFrame(data)
df
到目前为止我只能得到这个结果
df["team"] = df.apply(lambda x: re.findall(r"[A-Z][^A-Z]*(?:\s[A-Z][^A-Z]*)", x["teams"]), axis=1)
df
teams area league team
0 CubsWhite Sox Chicago MLB [White Sox]
1 Red Sox Boston MLB [Red Sox]
2 Blue Jay Toronto MLB [Blue Jay]
我从这里发现,在白色、红色和蓝色之后还有两个空格。
df["team"] = df.apply(lambda x: re.findall(r"[A-Z0-9][^A-Z]*", x["teams"]), axis=1)
df
teams area league team
0 CubsWhite Sox Chicago MLB [Cubs, White , Sox]
1 Red Sox Boston MLB [Red , Sox]
2 Blue Jay Toronto MLB [Blue , Jay]
我可以用
轻松删除
df['teams'] = df['teams'].str.replace(r' +', '')
你能帮我把这些队名拆分成这样吗,请使用re.findall?
df
teams area league team
0 CubsWhite Sox Chicago MLB [Cubs, White Sox]
1 Red Sox Boston MLB [Red Sox]
2 Blue Jay Toronto MLB [Blue Jay]
谢谢!
你可以试试:
df['new_teams'] = (df.teams.str.extract('([A-Z][a-z]+)?([A-Z]\w+\s+\w+)')
.apply(lambda x: list(x.dropna()), axis=1)
)
输出:
teams area league new_teams
0 CubsWhite Sox Chicago MLB [Cubs, White Sox]
1 Red Sox Boston MLB [Red Sox]
2 Blue Jay Toronto MLB [Blue Jay]
你可以使用
df['team'] = df['teams'].str.findall(r'[A-Z][a-z]*(?:\s+[A-Z][a-z]*)?')
见regex demo。 详情:
[A-Z][a-z]*
- 一个大写字母后跟任何零个或多个小写字母
(?:\s+[A-Z][a-z]*)?
- 匹配的可选非捕获组
\s+
- 一个或多个空格
[A-Z][a-z]*
- 一个大写字母后跟任意零个或多个小写字母。
Pandas 测试:
>>> df['teams'].str.findall(r'[A-Z][a-z]*(?:\s+[A-Z][a-z]*)?')
0 [Cubs, White Sox]
1 [Red Sox]
2 [Blue Jay]
Name: teams, dtype: object
我有一个数据框列,团队,我试图将团队名称 'CubsWhite Sox' 分成两部分,'Cubs' 和 'White Sox'。
import pandas as pd
import re
data = [{'teams':'CubsWhite Sox','area':'Chicago','league': 'MLB'}, {'teams': 'Red Sox','area':'Boston', 'league': 'MLB'}, {'teams': 'Blue Jay','area':'Toronto', 'league': 'MLB'}]
df = pd.DataFrame(data)
df
到目前为止我只能得到这个结果
df["team"] = df.apply(lambda x: re.findall(r"[A-Z][^A-Z]*(?:\s[A-Z][^A-Z]*)", x["teams"]), axis=1)
df
teams area league team
0 CubsWhite Sox Chicago MLB [White Sox]
1 Red Sox Boston MLB [Red Sox]
2 Blue Jay Toronto MLB [Blue Jay]
我从这里发现,在白色、红色和蓝色之后还有两个空格。
df["team"] = df.apply(lambda x: re.findall(r"[A-Z0-9][^A-Z]*", x["teams"]), axis=1)
df
teams area league team
0 CubsWhite Sox Chicago MLB [Cubs, White , Sox]
1 Red Sox Boston MLB [Red , Sox]
2 Blue Jay Toronto MLB [Blue , Jay]
我可以用
轻松删除df['teams'] = df['teams'].str.replace(r' +', '')
你能帮我把这些队名拆分成这样吗,请使用re.findall?
df
teams area league team
0 CubsWhite Sox Chicago MLB [Cubs, White Sox]
1 Red Sox Boston MLB [Red Sox]
2 Blue Jay Toronto MLB [Blue Jay]
谢谢!
你可以试试:
df['new_teams'] = (df.teams.str.extract('([A-Z][a-z]+)?([A-Z]\w+\s+\w+)')
.apply(lambda x: list(x.dropna()), axis=1)
)
输出:
teams area league new_teams
0 CubsWhite Sox Chicago MLB [Cubs, White Sox]
1 Red Sox Boston MLB [Red Sox]
2 Blue Jay Toronto MLB [Blue Jay]
你可以使用
df['team'] = df['teams'].str.findall(r'[A-Z][a-z]*(?:\s+[A-Z][a-z]*)?')
见regex demo。 详情:
[A-Z][a-z]*
- 一个大写字母后跟任何零个或多个小写字母(?:\s+[A-Z][a-z]*)?
- 匹配的可选非捕获组\s+
- 一个或多个空格[A-Z][a-z]*
- 一个大写字母后跟任意零个或多个小写字母。
Pandas 测试:
>>> df['teams'].str.findall(r'[A-Z][a-z]*(?:\s+[A-Z][a-z]*)?')
0 [Cubs, White Sox]
1 [Red Sox]
2 [Blue Jay]
Name: teams, dtype: object