findall() 行为（python 2.7）

Question

假设我有以下字符串：

"<p>Hello</p>NOT<p>World</p>"

我想提取 Hello 和 World

这两个词

我为作业创建了以下脚本

#!/usr/bin/env python

import re

string = "<p>Hello</p>NOT<p>World</p>"
match = re.findall(r"(<p>[\w\W]+</p>)", string)

print match

我对剥离

和

不是特别感兴趣，所以我从来没有费心在脚本中这样做。

解释器打印

['<p>Hello</p>NOT<p>World</p>']

所以它显然会看到第一个 和最后一个

而忽略标签之间的标签。不应该 findall() return 所有三组匹配字符串吗？（它打印的字符串和两个单词）。

如果不应该，我该如何修改代码来做到这一点？

PS：这是一个项目，我找到了一种替代方法来做我需要做的事情，所以我猜这是出于教育原因。

Answer 1

您在一次匹配中获得全部内容的原因是因为 [\w\W]+ 将匹配 尽可能多的 事物（包括您所有的  和  标签）。为防止这种情况，您想通过附加 ?.

来使用非贪婪版本

match = re.findall(r"(<p>[\w\W]+?</p>)", string)
# ['<p>Hello</p>', '<p>World</p>']

来自documentation：

*?, +?, ??
The '*', '+', and '?' qualifiers are all greedy; they match as much text as possible. Sometimes this behaviour isn’t desired; if the RE <.*> is matched against <a> b <c>, it will match the entire string, and not just <a>. Adding ? after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched. Using the RE <.*?> will match only <a>.

如果您不想在结果中包含  和  标记，您将需要使用前瞻和后视断言以不将它们包含在结果中。

match = re.findall(r"((?<=<p>)\w+?(?=</p>))", string)
# ['Hello', 'World']

附带说明一下，如果您尝试使用正则表达式解析 HTML 或 XML，最好使用诸如 BeautifulSoup 之类的库，它专用于解析 HTML.

findall() 行为（python 2.7）

findall() behaviour (python 2.7)

python

findall