为什么非贪婪 Python Regex 不够非贪婪？

Question

我已经在一组字符串 URL 上实现了非贪婪正则表达式，我试图在其中清理它们，以便它们在 .com（.co.uk 等）之后结束。他们中的一些人在所需的截止后继续 ' 或 " 或 <，所以我使用 x = re.findall('([A-Za-z0-9]+@\S+.co\S*?)[\'"<]', finalSoup2).

问题是某些 URL 是 misc@misc.misc'misc''misc'（或与 < > 类似），因此在实施非贪婪正则表达式后我仍然剩下例如 enquiries@smart-traffic.com.au">enquiries@smart-traffic.com.au。

我已经尝试了两个 ?? 的组合，但显然没有用，那么在这种情况下，获得干净 URL 的正确方法是什么？

Answer 1

你的正则表达式的问题是你目前只在寻找 Non-spaces(period)co 而不是寻找 Non-spaces(period)Non-spaces。

所以在这种情况下，您可以根据上述信息使用以下正则表达式。

>>> finalSoup2 = """
... misc@misc.misc'misc''misc
... enquiries@smart-traffic.com.au">enquiries@smart-traffic.com.au
... google.com
... google.co.uk"'<>Stuff
... """
>>>x = re.findall('([A-Za-z0-9]+@[^\'"<>]+)[\'"<]', finalSoup2)
>>>x
['misc@misc.misc',
 'enquiries@smart-traffic.com.au',
 'enquiries@smart-traffic.com.au\ngoogle.com\ngoogle.co.uk']

然后您可以使用它来获取您想要的网址，但您必须确保在 r'\n' 上拆分它们，因为它们可能在文本中有一个换行符，如上所示。

为什么非贪婪 Python Regex 不够非贪婪？

Why is non-greedy Python Regex not non-greedy enough?

python

regex

non-greedy