python 拆分字符串并获取所有单词的正则表达式不起作用

Question

我正在尝试 split 使用 regular expression 和 python 的字符串并获取所有匹配的文字。

回复：\w+(\.?\w+)*

这只需要捕获 [a-zA-Z0-9_] 类似的东西。

但是当我尝试匹配并从字符串中获取所有内容时，它没有 return 正确的结果。

代码片段：

>>> import re
>>> from pprint import pprint
>>> pattern = r"\w+(\.?\w+)*"
>>> string = """this is some test string and there are some digits as well that need to be captured as well like 1234567890 and 321 etc. But it should also select _ as well. I'm pretty sure that that RE does exactly the same.
... Oh wait, it also need to filter out the symbols like !@#$%^&*()-+=[]{}.,;:'"`| \(`.`)/
... 
... I guess that's it."""
>>> pprint(re.findall(r"\w+(.?\w+)*", string))
[' etc', ' well', ' same', ' wait', ' like', ' it']

它只是 return 一些单词，但实际上它应该 return 所有单词、数字和下划线 [如链接示例中所示]。

python 版本：Python 3.6.2（默认，2017 年 7 月 17 日，16:44:45）

谢谢。

Answer 1

您需要使用非捕获组（请参阅 why) and escape the dot (see here 在正则表达式中应转义哪些字符）：

>>> import re
>>> from pprint import pprint
>>> pattern = r"\w+(?:\.?\w+)*"
>>> string = """this is some test string and there are some digits as well that need to be captured as well like 1234567890 and 321 etc. But it should also select _ as well. I'm pretty sure that that RE does exactly the same.
... Oh wait, it also need to filter out the symbols like !@#$%^&*()-+=[]{}.,;:'"`| \(`.`)/
... 
... I guess that's it."""
>>> pprint(re.findall(pattern, string, re.A))
['this', 'is', 'some', 'test', 'string', 'and', 'there', 'are', 'some', 'digits', 'as', 'well', 'that', 'need', 'to', 'be', 'captured', 'as', 'well', 'like', '1234567890', 'and', '321', 'etc', 'But', 'it', 'should', 'also', 'select', '_', 'as', 'well', 'I', 'm', 'pretty', 'sure', 'that', 'that', 'RE', 'does', 'exactly', 'the', 'same', 'Oh', 'wait', 'it', 'also', 'need', 'to', 'filter', 'out', 'the', 'symbols', 'like', 'I', 'guess', 'that', 's', 'it']

此外，要仅匹配 ASCII 字母、数字和 _，您必须传递 re.A 标志。

参见Python demo。

python 拆分字符串并获取所有单词的正则表达式不起作用

python regular expression to split string and get all words is not working

python

regex

python-3.6