为什么这个正则表达式是贪婪的，为什么示例代码会永远重复？

Question

我正在疯狂地想弄明白这个问题。现在已经 3 天了，我准备放弃了。下面的代码应该 return 一个列表，不重复，包含剪贴板上的所有 phone 号码和电子邮件。

#! python 3
#! Phone number and email address scraper

#take user input for:
#1. webpage to scrape
# - user will be prompted to copy a link
#2. file & location to save to
#3. back to 1 or exit

import pyperclip, re, os.path

#function for locating phone numbers
def phoneNums(clipboard):
    phoneNums = re.compile(r'^(?:\d{8}(?:\d{2}(?:\d{2})?)?|\(\+?\d{2,3}\)\s?(?:\d{4}[\s*.-]?\d{4}|\d{3}[\s*.-]?\d{3}|\d{2}([\s*.-]?)\d{2}\d{2}(?:\d{2})?))$')
        #(\+\d{1,4})?                   #Optional country code (optional: +, 1-4 digits)
        #(\s)?                          #Optional space
        #(\(\d\))?                      #Optional bracketed area code
        #(\d\d(\s)?\d | \d{3})          #3 digits with optional space between
        #(\s)?                          #Optional space
        #(\d{3})                        #3 digits
        #(\s)?                          #Optional space
        #(\d{4})                        #Last four
        #)
        #)', re.VERBOSE)
    #nos = phoneNums.search(clipboard)  #ignore for now. Failed test of .group()

    return phoneNums.findall(clipboard)

#function for locating email addresses
def emails(clipboard):
    emails = re.compile(r'''(
        [a-z0-9._%+-]*     #username
        @                  #@ sign
        [a-z0-9.-]+        #domain name
        )''', re.I | re.VERBOSE)
    return emails.findall(clipboard)


#function for copying email addresses and numbers from webpage to a file
def scrape(fileName, saveLoc):
    newFile = os.path.join(saveLoc, fileName + ".txt")
    #file = open(newFile, "w+")
    #add phoneNums(currentText) +
    print(currentText)
    print(emails(currentText))
    print(phoneNums(currentText))
    #file.write(emails(currentText))
    #file.close()

url = ''
currentText = ''
file = ''
location =  ''

while True:
    print("Please paste text to scrape. Press ENTER to exit.")
    currentText = str(pyperclip.waitForNewPaste())
    #print("Filename?")
    #file = str(input())
    #print("Where shall I save this? Defaults to C:")
    #location = str(input())
    scrape(file, location)

电子邮件 return 正确，但散列部分的 phone 数字输出如下：

[('+30 210 458 6600', '+30', ' ', '', '210', '', ' ', '458', ' ', '6600'), ('+30 210 458 6601', '+30', ' ', '', '210', '', ' ', '458', ' ', '6601')]

如您所见，已正确识别数字，但我的代码过于贪婪，因此我尝试添加“+?”：

def phoneNums(clipboard):
    phoneNums = re.compile(r'''(
        (\+\d{1,4})?                   #Optional country code (optional: +, 1-4 digits)
        (\s)?                          #Optional space
        (\(\d\))?                      #Optional bracketed area code
        (\d\d(\s)?\d | \d{3})          #3 digits with optional space between
        (\s)?                          #Optional space
        (\d{3})                        #3 digits
        (\s)?                          #Optional space
        (\d{4})                        #Last four
        )+?''', re.VERBOSE)

没有快乐。我尝试从这里插入一个正则表达式示例：Find phone numbers in python script

现在我知道这是可行的，因为其他人已经测试过了。我得到的是：

Please paste text to scrape. Press ENTER to exit. 
[] [] 
Please paste text to scrape. Press ENTER to exit. 
[] [('', '', '', '', '', '', '','', '', '')] 
...forever...

最后一个甚至不允许我复制到剪贴板。 .waitForNewPaste() 应该按照锡罐上的说明进行操作，但是当我运行代码时，程序会提取剪贴板上的内容并尝试对其进行处理（效果不佳）。

我的代码中显然有问题，但我看不到。有什么想法吗？

Answer 1

正如您所指出的，正则表达式有效。

输入部分'+30 210 458 6600'被匹配一次，结果是所有捕获子组的元组：('+30 210 458 6600', '+30', ' ', ' ', '210', '', ' ', '458', ' ', '6600')

请注意，元组中的第一个元素是整个匹配项。

如果通过在左括号后插入 ?: 使所有组 non-capturing ，将没有捕获组，结果将只有完整匹配 '+30 210 458 6600 ' 作为 str.

    phoneNums = re.compile(r'''
        (?:\+\d{1,4})?                   #Optional country code (optional: +, 1-4 digits)
        (?:\s)?                          #Optional space
        (?:\(\d\))?                      #Optional bracketed area code
        (?:\d\d(?:\s)?\d | \d{3})        #3 digits with optional space between
        (?:\s)?                          #Optional space
        (?:\d{3})                        #3 digits
        (?:\s)?                          #Optional space
        (?:\d{4})                        #Last four
        ''', re.VERBOSE)

代码 'repeats forever' 因为 while True: 块是 infinite loop。如果你想在比方说一次迭代后停止，你可以在块的末尾放置一个 break 语句来停止循环。

while True:
    currentText = str(pyperclip.waitForNewPaste())
    scrape(file, location)
    break

为什么这个正则表达式是贪婪的，为什么示例代码会永远重复？

Why is this regex greedy and why does the example code repeat forever?

python

regex

pyperclip