为什么这个正则表达式是贪婪的,为什么示例代码会永远重复?
Why is this regex greedy and why does the example code repeat forever?
我正在疯狂地想弄明白这个问题。现在已经 3 天了,我准备放弃了。下面的代码应该 return 一个列表,不重复,包含剪贴板上的所有 phone 号码和电子邮件。
#! python 3
#! Phone number and email address scraper
#take user input for:
#1. webpage to scrape
# - user will be prompted to copy a link
#2. file & location to save to
#3. back to 1 or exit
import pyperclip, re, os.path
#function for locating phone numbers
def phoneNums(clipboard):
phoneNums = re.compile(r'^(?:\d{8}(?:\d{2}(?:\d{2})?)?|\(\+?\d{2,3}\)\s?(?:\d{4}[\s*.-]?\d{4}|\d{3}[\s*.-]?\d{3}|\d{2}([\s*.-]?)\d{2}\d{2}(?:\d{2})?))$')
#(\+\d{1,4})? #Optional country code (optional: +, 1-4 digits)
#(\s)? #Optional space
#(\(\d\))? #Optional bracketed area code
#(\d\d(\s)?\d | \d{3}) #3 digits with optional space between
#(\s)? #Optional space
#(\d{3}) #3 digits
#(\s)? #Optional space
#(\d{4}) #Last four
#)
#)', re.VERBOSE)
#nos = phoneNums.search(clipboard) #ignore for now. Failed test of .group()
return phoneNums.findall(clipboard)
#function for locating email addresses
def emails(clipboard):
emails = re.compile(r'''(
[a-z0-9._%+-]* #username
@ #@ sign
[a-z0-9.-]+ #domain name
)''', re.I | re.VERBOSE)
return emails.findall(clipboard)
#function for copying email addresses and numbers from webpage to a file
def scrape(fileName, saveLoc):
newFile = os.path.join(saveLoc, fileName + ".txt")
#file = open(newFile, "w+")
#add phoneNums(currentText) +
print(currentText)
print(emails(currentText))
print(phoneNums(currentText))
#file.write(emails(currentText))
#file.close()
url = ''
currentText = ''
file = ''
location = ''
while True:
print("Please paste text to scrape. Press ENTER to exit.")
currentText = str(pyperclip.waitForNewPaste())
#print("Filename?")
#file = str(input())
#print("Where shall I save this? Defaults to C:")
#location = str(input())
scrape(file, location)
电子邮件 return 正确,但散列部分的 phone 数字输出如下:
[('+30 210 458 6600', '+30', ' ', '', '210', '', ' ', '458', ' ',
'6600'), ('+30 210 458 6601', '+30', ' ', '', '210', '', ' ', '458', '
', '6601')]
如您所见,已正确识别数字,但我的代码过于贪婪,因此我尝试添加“+?”:
def phoneNums(clipboard):
phoneNums = re.compile(r'''(
(\+\d{1,4})? #Optional country code (optional: +, 1-4 digits)
(\s)? #Optional space
(\(\d\))? #Optional bracketed area code
(\d\d(\s)?\d | \d{3}) #3 digits with optional space between
(\s)? #Optional space
(\d{3}) #3 digits
(\s)? #Optional space
(\d{4}) #Last four
)+?''', re.VERBOSE)
没有快乐。我尝试从这里插入一个正则表达式示例:Find phone numbers in python script
现在我知道这是可行的,因为其他人已经测试过了。我得到的是:
Please paste text to scrape. Press ENTER to exit.
[] []
Please paste text to scrape. Press ENTER to exit.
[] [('', '', '', '', '', '', '','', '', '')]
...forever...
最后一个甚至不允许我复制到剪贴板。 .waitForNewPaste() 应该按照锡罐上的说明进行操作,但是当我 运行 代码时,程序会提取剪贴板上的内容并尝试对其进行处理(效果不佳)。
我的代码中显然有问题,但我看不到。有什么想法吗?
正如您所指出的,正则表达式有效。
输入部分'+30 210 458 6600'被匹配一次,结果是所有捕获子组的元组:('+30 210 458 6600', '+30', ' ', ' ', '210', '', ' ', '458', ' ', '6600')
请注意,元组中的第一个元素是整个匹配项。
如果通过在左括号后插入 ?:
使所有组 non-capturing ,将没有捕获组,结果将只有完整匹配 '+30 210 458 6600 ' 作为 str
.
phoneNums = re.compile(r'''
(?:\+\d{1,4})? #Optional country code (optional: +, 1-4 digits)
(?:\s)? #Optional space
(?:\(\d\))? #Optional bracketed area code
(?:\d\d(?:\s)?\d | \d{3}) #3 digits with optional space between
(?:\s)? #Optional space
(?:\d{3}) #3 digits
(?:\s)? #Optional space
(?:\d{4}) #Last four
''', re.VERBOSE)
代码 'repeats forever' 因为 while True:
块是 infinite loop。如果你想在比方说一次迭代后停止,你可以在块的末尾放置一个 break
语句来停止循环。
while True:
currentText = str(pyperclip.waitForNewPaste())
scrape(file, location)
break
我正在疯狂地想弄明白这个问题。现在已经 3 天了,我准备放弃了。下面的代码应该 return 一个列表,不重复,包含剪贴板上的所有 phone 号码和电子邮件。
#! python 3
#! Phone number and email address scraper
#take user input for:
#1. webpage to scrape
# - user will be prompted to copy a link
#2. file & location to save to
#3. back to 1 or exit
import pyperclip, re, os.path
#function for locating phone numbers
def phoneNums(clipboard):
phoneNums = re.compile(r'^(?:\d{8}(?:\d{2}(?:\d{2})?)?|\(\+?\d{2,3}\)\s?(?:\d{4}[\s*.-]?\d{4}|\d{3}[\s*.-]?\d{3}|\d{2}([\s*.-]?)\d{2}\d{2}(?:\d{2})?))$')
#(\+\d{1,4})? #Optional country code (optional: +, 1-4 digits)
#(\s)? #Optional space
#(\(\d\))? #Optional bracketed area code
#(\d\d(\s)?\d | \d{3}) #3 digits with optional space between
#(\s)? #Optional space
#(\d{3}) #3 digits
#(\s)? #Optional space
#(\d{4}) #Last four
#)
#)', re.VERBOSE)
#nos = phoneNums.search(clipboard) #ignore for now. Failed test of .group()
return phoneNums.findall(clipboard)
#function for locating email addresses
def emails(clipboard):
emails = re.compile(r'''(
[a-z0-9._%+-]* #username
@ #@ sign
[a-z0-9.-]+ #domain name
)''', re.I | re.VERBOSE)
return emails.findall(clipboard)
#function for copying email addresses and numbers from webpage to a file
def scrape(fileName, saveLoc):
newFile = os.path.join(saveLoc, fileName + ".txt")
#file = open(newFile, "w+")
#add phoneNums(currentText) +
print(currentText)
print(emails(currentText))
print(phoneNums(currentText))
#file.write(emails(currentText))
#file.close()
url = ''
currentText = ''
file = ''
location = ''
while True:
print("Please paste text to scrape. Press ENTER to exit.")
currentText = str(pyperclip.waitForNewPaste())
#print("Filename?")
#file = str(input())
#print("Where shall I save this? Defaults to C:")
#location = str(input())
scrape(file, location)
电子邮件 return 正确,但散列部分的 phone 数字输出如下:
[('+30 210 458 6600', '+30', ' ', '', '210', '', ' ', '458', ' ', '6600'), ('+30 210 458 6601', '+30', ' ', '', '210', '', ' ', '458', ' ', '6601')]
如您所见,已正确识别数字,但我的代码过于贪婪,因此我尝试添加“+?”:
def phoneNums(clipboard):
phoneNums = re.compile(r'''(
(\+\d{1,4})? #Optional country code (optional: +, 1-4 digits)
(\s)? #Optional space
(\(\d\))? #Optional bracketed area code
(\d\d(\s)?\d | \d{3}) #3 digits with optional space between
(\s)? #Optional space
(\d{3}) #3 digits
(\s)? #Optional space
(\d{4}) #Last four
)+?''', re.VERBOSE)
没有快乐。我尝试从这里插入一个正则表达式示例:Find phone numbers in python script
现在我知道这是可行的,因为其他人已经测试过了。我得到的是:
Please paste text to scrape. Press ENTER to exit.
[] []
Please paste text to scrape. Press ENTER to exit.
[] [('', '', '', '', '', '', '','', '', '')]
...forever...
最后一个甚至不允许我复制到剪贴板。 .waitForNewPaste() 应该按照锡罐上的说明进行操作,但是当我 运行 代码时,程序会提取剪贴板上的内容并尝试对其进行处理(效果不佳)。
我的代码中显然有问题,但我看不到。有什么想法吗?
正如您所指出的,正则表达式有效。
输入部分'+30 210 458 6600'被匹配一次,结果是所有捕获子组的元组:('+30 210 458 6600', '+30', ' ', ' ', '210', '', ' ', '458', ' ', '6600')
请注意,元组中的第一个元素是整个匹配项。
如果通过在左括号后插入 ?:
使所有组 non-capturing ,将没有捕获组,结果将只有完整匹配 '+30 210 458 6600 ' 作为 str
.
phoneNums = re.compile(r'''
(?:\+\d{1,4})? #Optional country code (optional: +, 1-4 digits)
(?:\s)? #Optional space
(?:\(\d\))? #Optional bracketed area code
(?:\d\d(?:\s)?\d | \d{3}) #3 digits with optional space between
(?:\s)? #Optional space
(?:\d{3}) #3 digits
(?:\s)? #Optional space
(?:\d{4}) #Last four
''', re.VERBOSE)
代码 'repeats forever' 因为 while True:
块是 infinite loop。如果你想在比方说一次迭代后停止,你可以在块的末尾放置一个 break
语句来停止循环。
while True:
currentText = str(pyperclip.waitForNewPaste())
scrape(file, location)
break