创建数据框，在特定单词后提取带句点的单词

Question

我有以下文字：

text_main = "The following leagues were identified: sport: basketball league: N.B.A. style: c.123>d sport: soccer league: E.P.L. sport: football league: N.F.L. style: c.124>d. The other leagues aren't that important."

我需要提取所有运动名称（在 sport: 之后）和样式（在 style: 之后）并创建新列 sports 和 style。我正在尝试使用以下代码来提取主句（有时文本很大）：

m = re.split(r'(?<=\.)\s+(?=[A-Z]\w+)', text_main)
text = list(filter(lambda x: re.search(r'leagues were identified', x, flags=re.IGNORECASE), m))[0]
print(text)

The following leagues were identified: sport: basketball league: N.B.A. style: c.123>d sport: soccer league: E.P.L. sport: football league: N.F.L. style: c.124>d.

然后我提取运动和风格名称并将它们放入数据框中：

if 'sport:' in text:
    sport_list = re.findall(r'sport:\W*(\w+)', text)

df = pd.DataFrame({'sports': sport_list})
print(df)

    sports
0   basketball
1   soccer
2   football

但是，我在提取样式时遇到了麻烦，因为所有样式在第一个字母 (c) 之后都有句点 .，很少有符号 >。此外，并非所有运动项目都有样式信息。

期望的输出：

    sports        style
0   basketball    c.123>d
1   soccer        NA
2   football      c.124>d

最明智的做法是什么？任何建议，将不胜感激。谢谢！

Answer 1

您可以使用

\bsport:\s*(\w+)(?:(?:(?!\bsport:).)*?\bstyle:\s*(\S+))?

见regex demo。详情:

\b - 单词边界
sport: - 固定字符串
\s* - 零个或多个空格
(\w+) - 第 1 组：一个或多个单词字符
(?: - 可选 non-capturing 组的开始：
- (?:(?!\bsport:).)*? - 除换行符以外的任何字符，零次或多次出现但尽可能少，不会开始整个单词 sport: 字符序列
- \bstyle: - 一个完整的单词 style 然后 :
- \s* - 零个或多个空格
- (\S+) - 第 1 组：一个或多个 non-whitespace 个字符
)? - 可选 non-capturing 组结束。

查看 Python 演示：

import pandas as pd
text_main = "The following leagues were identified: sport: basketball league: N.B.A. style: c.123>d sport: soccer league: E.P.L. sport: football league: N.F.L. style: c.124>d. The other leagues aren't that important."
matches = re.findall(r'\bsport:\s*(\w+)(?:(?:(?!\bsport:).)*?\bstyle:\s*(\S+))?', text_main)
df = pd.DataFrame(matches, columns=['sports', 'style'])

输出：

>>> df
       sports    style
0  basketball   c.123>d
1      soccer          
2    football  c.124>d.

创建数据框，在特定单词后提取带句点的单词

Create Dataframe Exatracting Words With Period After A Specicfic Word

python

regex

string

dataframe

python-re