如何在 pandas 数据帧中用正则表达式分隔带有 2 个大写字母和 space 的字符串？

Question

我有一个数据框列，团队，我试图将团队名称 'CubsWhite Sox' 分成两部分，'Cubs' 和 'White Sox'。

import pandas as pd
import re
data = [{'teams':'CubsWhite Sox','area':'Chicago','league': 'MLB'}, {'teams': 'Red Sox','area':'Boston', 'league': 'MLB'}, {'teams': 'Blue Jay','area':'Toronto', 'league': 'MLB'}] 

df = pd.DataFrame(data) 
df

到目前为止我只能得到这个结果

df["team"] = df.apply(lambda x: re.findall(r"[A-Z][^A-Z]*(?:\s[A-Z][^A-Z]*)", x["teams"]), axis=1)
df
    teams           area    league   team
0   CubsWhite Sox   Chicago MLB      [White Sox]
1   Red Sox         Boston  MLB      [Red Sox]
2   Blue Jay        Toronto MLB      [Blue Jay]

我从这里发现，在白色、红色和蓝色之后还有两个空格。

df["team"] = df.apply(lambda x: re.findall(r"[A-Z0-9][^A-Z]*", x["teams"]), axis=1)
df
    teams           area    league  team
0   CubsWhite Sox   Chicago MLB     [Cubs, White , Sox]
1   Red Sox         Boston  MLB     [Red , Sox]
2   Blue Jay        Toronto MLB     [Blue , Jay]

我可以用

轻松删除

df['teams'] = df['teams'].str.replace(r' +', '')

你能帮我把这些队名拆分成这样吗，请使用re.findall？

df
    teams           area    league  team
0   CubsWhite Sox   Chicago MLB     [Cubs, White Sox]
1   Red Sox         Boston  MLB     [Red Sox]
2   Blue Jay        Toronto MLB     [Blue Jay]

谢谢！

Answer 1

你可以试试：

df['new_teams'] = (df.teams.str.extract('([A-Z][a-z]+)?([A-Z]\w+\s+\w+)')
                     .apply(lambda x: list(x.dropna()), axis=1)
                  )

输出：

           teams     area league          new_teams
0  CubsWhite Sox  Chicago    MLB  [Cubs, White Sox]
1        Red Sox   Boston    MLB          [Red Sox]
2       Blue Jay  Toronto    MLB         [Blue Jay]

Answer 2

你可以使用

df['team'] = df['teams'].str.findall(r'[A-Z][a-z]*(?:\s+[A-Z][a-z]*)?')

见regex demo。详情：

[A-Z][a-z]* - 一个大写字母后跟任何零个或多个小写字母
(?:\s+[A-Z][a-z]*)? - 匹配的可选非捕获组
- \s+ - 一个或多个空格
- [A-Z][a-z]* - 一个大写字母后跟任意零个或多个小写字母。

Pandas 测试：

>>> df['teams'].str.findall(r'[A-Z][a-z]*(?:\s+[A-Z][a-z]*)?')
0    [Cubs, White Sox]
1            [Red Sox]
2           [Blue Jay]
Name: teams, dtype: object

如何在 pandas 数据帧中用正则表达式分隔带有 2 个大写字母和 space 的字符串？

How to separate a string with 2 uppercases and a space with regex in pandas dataframe?

regex

dataframe

python-3.x

pandas

regex-lookarounds