正则表达式 re.findall 以查找两边带有 space 的子字符串

Question

我正在尝试在我的文本列中使用 re.findall 来查找以下任一两边都有空格的内容，因为这是唯一重要的。我正在使用以下脚本

url = '#MnA deals for 2015 across all #oilandgas sectors were lower than WAR WARduring the CFO Great CIO Recession' 

regex=re.findall(r'WAR|CIO|CISO|CTO|C-Suite|CMO|CFO|Founder+',url)  
print regex 
['WAR', 'WAR', 'CFO', 'CIO']

我只想要

而不是这个

['WAR', 'CFO', 'CIO']

因为第二次不只是 WAR，它 WAR 期间我不想要那个

还有什么运算符可以在下标前面获取我想看到的所有内容，例如

['WAR', 'WARduring','CFO', 'CIO']

感谢每一个帮助

Answer 1

您可以使用前瞻：

>>> re.findall(r'\b(?:WAR|CIO|CISO|CTO|C-Suite|CMO|CFO|Founder+)(?=\s|$)', url)
['WAR', 'CFO', 'CIO']

(?=\s|$) 将在您的关键字后断言存在空格或行结束。

对于第二个任务，使用此正则表达式：

>>> re.findall(r'\b((?:WAR|CIO|CISO|CTO|C-Suite|CMO|CFO|Founder+)\w*)', url)
['WAR', 'WARduring', 'CFO', 'CIO']

此处\w*你的关键字将匹配0个或多个单词字符。

Answer 2

在您的正则表达式中使用单词边界 [Know more ] 将解决您的问题

正则表达式

\b(?:WAR|CIO|CISO|CTO|C-Suite|CMO|CFO|Founder+)\b

代码

url = '#MnA deals for 2015 across all #oilandgas sectors were lower than WAR WARduring the CFO Great CIO Recession' 

regex=re.findall(r'\b(WAR|CIO|CISO|CTO|C-Suite|CMO|CFO|Founder+)\b',url)  
print regex 
['WAR', 'CFO', 'CIO']

Answer 3

方法一：将WAR错误检测为WAR

另一种方法：使用 \b 来分隔单词

regex=re.findall(r'\b(WAR|CIO|CISO|CTO|C-Suite|CMO|CFO|Founder)\b',url)


url = '#MnA deals for 2015 across all #oilandgas theWAR sectors were lower than WAR WARduring the CFO Great CIO'

regex=re.findall(r'(WAR|CIO|CISO|CTO|C-Suite|CMO|CFO|Founder+)(?=\s|$)', url)  # bug with start of word
print regex
regex=re.findall(r'\b(WAR|CIO|CISO|CTO|C-Suite|CMO|CFO|Founder)\b',url)
print regex
['WAR', 'WAR', 'CFO', 'CIO']
['WAR', 'CFO', 'CIO']

正则表达式 re.findall 以查找两边带有 space 的子字符串

regex re.findall to to find the substrings with space on both sides

python

regex

findall