创建数据框,在特定单词后提取带句点的单词
Create Dataframe Exatracting Words With Period After A Specicfic Word
我有以下文字:
text_main = "The following leagues were identified: sport: basketball league: N.B.A. style: c.123>d sport: soccer league: E.P.L. sport: football league: N.F.L. style: c.124>d. The other leagues aren't that important."
我需要提取所有运动名称(在 sport:
之后)和样式(在 style:
之后)并创建新列 sports
和 style
。我正在尝试使用以下代码来提取主句(有时文本很大):
m = re.split(r'(?<=\.)\s+(?=[A-Z]\w+)', text_main)
text = list(filter(lambda x: re.search(r'leagues were identified', x, flags=re.IGNORECASE), m))[0]
print(text)
The following leagues were identified: sport: basketball league: N.B.A. style: c.123>d sport: soccer league: E.P.L. sport: football league: N.F.L. style: c.124>d.
然后我提取运动和风格名称并将它们放入数据框中:
if 'sport:' in text:
sport_list = re.findall(r'sport:\W*(\w+)', text)
df = pd.DataFrame({'sports': sport_list})
print(df)
sports
0 basketball
1 soccer
2 football
但是,我在提取样式时遇到了麻烦,因为所有样式在第一个字母 (c
) 之后都有句点 .
,很少有符号 >
。此外,并非所有运动项目都有样式信息。
期望的输出:
sports style
0 basketball c.123>d
1 soccer NA
2 football c.124>d
最明智的做法是什么?任何建议,将不胜感激。谢谢!
您可以使用
\bsport:\s*(\w+)(?:(?:(?!\bsport:).)*?\bstyle:\s*(\S+))?
见regex demo。 详情:
\b
- 单词边界
sport:
- 固定字符串
\s*
- 零个或多个空格
(\w+)
- 第 1 组:一个或多个单词字符
(?:
- 可选 non-capturing 组的开始:
(?:(?!\bsport:).)*?
- 除换行符以外的任何字符,零次或多次出现但尽可能少,不会开始整个单词 sport:
字符序列
\bstyle:
- 一个完整的单词 style
然后 :
\s*
- 零个或多个空格
(\S+)
- 第 1 组:一个或多个 non-whitespace 个字符
)?
- 可选 non-capturing 组结束。
查看 Python 演示:
import pandas as pd
text_main = "The following leagues were identified: sport: basketball league: N.B.A. style: c.123>d sport: soccer league: E.P.L. sport: football league: N.F.L. style: c.124>d. The other leagues aren't that important."
matches = re.findall(r'\bsport:\s*(\w+)(?:(?:(?!\bsport:).)*?\bstyle:\s*(\S+))?', text_main)
df = pd.DataFrame(matches, columns=['sports', 'style'])
输出:
>>> df
sports style
0 basketball c.123>d
1 soccer
2 football c.124>d.
我有以下文字:
text_main = "The following leagues were identified: sport: basketball league: N.B.A. style: c.123>d sport: soccer league: E.P.L. sport: football league: N.F.L. style: c.124>d. The other leagues aren't that important."
我需要提取所有运动名称(在 sport:
之后)和样式(在 style:
之后)并创建新列 sports
和 style
。我正在尝试使用以下代码来提取主句(有时文本很大):
m = re.split(r'(?<=\.)\s+(?=[A-Z]\w+)', text_main)
text = list(filter(lambda x: re.search(r'leagues were identified', x, flags=re.IGNORECASE), m))[0]
print(text)
The following leagues were identified: sport: basketball league: N.B.A. style: c.123>d sport: soccer league: E.P.L. sport: football league: N.F.L. style: c.124>d.
然后我提取运动和风格名称并将它们放入数据框中:
if 'sport:' in text:
sport_list = re.findall(r'sport:\W*(\w+)', text)
df = pd.DataFrame({'sports': sport_list})
print(df)
sports
0 basketball
1 soccer
2 football
但是,我在提取样式时遇到了麻烦,因为所有样式在第一个字母 (c
) 之后都有句点 .
,很少有符号 >
。此外,并非所有运动项目都有样式信息。
期望的输出:
sports style
0 basketball c.123>d
1 soccer NA
2 football c.124>d
最明智的做法是什么?任何建议,将不胜感激。谢谢!
您可以使用
\bsport:\s*(\w+)(?:(?:(?!\bsport:).)*?\bstyle:\s*(\S+))?
见regex demo。 详情:
\b
- 单词边界sport:
- 固定字符串\s*
- 零个或多个空格(\w+)
- 第 1 组:一个或多个单词字符(?:
- 可选 non-capturing 组的开始:(?:(?!\bsport:).)*?
- 除换行符以外的任何字符,零次或多次出现但尽可能少,不会开始整个单词sport:
字符序列\bstyle:
- 一个完整的单词style
然后:
\s*
- 零个或多个空格(\S+)
- 第 1 组:一个或多个 non-whitespace 个字符
)?
- 可选 non-capturing 组结束。
查看 Python 演示:
import pandas as pd
text_main = "The following leagues were identified: sport: basketball league: N.B.A. style: c.123>d sport: soccer league: E.P.L. sport: football league: N.F.L. style: c.124>d. The other leagues aren't that important."
matches = re.findall(r'\bsport:\s*(\w+)(?:(?:(?!\bsport:).)*?\bstyle:\s*(\S+))?', text_main)
df = pd.DataFrame(matches, columns=['sports', 'style'])
输出:
>>> df
sports style
0 basketball c.123>d
1 soccer
2 football c.124>d.