从字符串中提取出现在关键字之前的 words/sentence - Python

Extract words/sentence that occurs before a keyword from a string - Python

我有这样的字符串,

my_str ='·in this match, dated may 1, 2013 (the "the match") is between brooklyn centenniel, resident of detroit, michigan ("champion") and kamil kubaru, the challenger from alexandria, virginia ("underdog").'

现在,我想使用关键字 championunderdog 提取当前 championunderdog

这里真正具有挑战性的是两个竞争者的名字都出现在括号内的关键字之前。我想使用正则表达式并提取信息。

以下是我所做的,

champion = re.findall(r'("champion"[^.]*.)', my_str)
print(champion)

>> ['"champion") and kamil kubaru, the challenger from alexandria, virginia ("underdog").']


underdog = re.findall(r'("underdog"[^.]*.)', my_str)
print(underdog)

>>['"underdog").']

但是,我需要结果,champion as:

brooklyn centenniel, resident of detroit, michigan

underdog 为:

kamil kubaru, the challenger from alexandria, virginia

如何使用正则表达式执行此操作? (我一直在搜索,如果我可以从关键字中返回 couple 或 words 以获得我想要的结果,但还没有运气)任何帮助或建议将不胜感激。

您可以使用命名捕获组来捕获所需的结果:

between\s+(?P<champion>.*?)\s+\("champion"\)\s+and\s+(?P<underdog>.*?)\s+\("underdog"\)
  • between\s+(?P<champion>.*?)\s+\("champion"\) 匹配从 between("champion") 的块,并将所需部分作为命名的捕获组 champion

  • 之后,\s+and\s+(?P<underdog>.*?)\s+\("underdog"\) 匹配到 ("underdog") 的块,并再次从这里获取所需的部分作为命名的捕获组 underdog

示例:

In [26]: my_str ='·in this match, dated may 1, 2013 (the "the match") is between brooklyn centenniel, resident of detroit, michigan ("champion") and kamil kubaru, the challenger from alexandria, virginia 
    ...: ("underdog").'

In [27]: out = re.search(r'between\s+(?P<champion>.*?)\s+\("champion"\)\s+and\s+(?P<underdog>.*?)\s+\("underdog"\)', my_str)

In [28]: out.groupdict()
Out[28]: 
{'champion': 'brooklyn centenniel, resident of detroit, michigan',
 'underdog': 'kamil kubaru, the challenger from alexandria, virginia'}

会有比这更好的答案,我根本不懂正则表达式,但我很无聊,所以这是我的 2 美分。

以下是我的处理方式:

words = my_str.split()
index = words.index('("champion")')
champion = words[index - 6:index]
champion = " ".join(champion)

对于弱者,您必须将 6 更改为 7,将 '("champion")' 更改为 '("underdog").'

不确定这是否能解决您的问题,但对于这个特定的字符串,在我测试时它起作用了。

如果 underdog 的尾随句点有问题,您也可以使用 str.strip() 删除标点符号。