Python 和正则表达式：重新 findall() 的问题

Question

这是一个找到的项目@ https://automatetheboringstuff.com/2e/chapter7/ 它在剪贴板上的文本中搜索 phone 个数字和电子邮件，然后再次将结果复制到剪贴板。

如果我理解正确，当正则表达式包含组时，findall() 函数 returns 元组列表。每个元组将包含与每个正则表达式组匹配的字符串。

现在这是我的问题：据我所知，phoneRegex 上的正则表达式仅包含 6 个组（在代码上编号）（所以我希望元组长度为 6）

但是当我打印元组时，我得到长度为 9 的元组

('800-420-7240', '800', '-', '420', '-', '7240', '', '', '')
('415-863-9900', '415', '-', '863', '-', '9900', '', '', '')
('415-863-9950', '415', '-', '863', '-', '9950', '', '', '')

我错过了什么？

#! python3
# phoneAndEmail.py - Finds phone numbers and email addresses on the clipboard.

import pyperclip, re

phoneRegex = re.compile(r'''(
    (\d{3}|\(\d{3}\))?                # area code (first group?)0
    (\s|-|\.)?                        # separator               1
    (\d{3})                           # first 3 digits          2
    (\s|-|\.)                         # separator               3
    (\d{4})                           # last 4 digits           4
    (\s*(ext|x|ext.)\s*(\d{2,5}))?    # extension               5
    )''', re.VERBOSE)

# Create email regex.
emailRegex = re.compile(r'''(
   [a-zA-Z0-9._%+-]+      # username
   @                      # @ symbol
   [a-zA-Z0-9.-]+         # domain name
    (\.[a-zA-Z]{2,4})     # dot-something
    )''', re.VERBOSE)

text = str(pyperclip.paste())

matches = []
for groups in phoneRegex.findall(text):
    print(groups)
    phoneNum = '-'.join([groups[1], groups[3], groups[5]])
    if groups[8] != '':
        phoneNum += ' x' + groups[8]
    matches.append(phoneNum)
for groups in emailRegex.findall(text):
    matches.append(groups[0])

    # Copy results to the clipboard.
if len(matches) > 0:
    pyperclip.copy('\n'.join(matches))
    print('Copied to clipboard:')
    print('\n'.join(matches))
else:
    print('No phone numbers or email addresses found.')

Answer 1

括号中的任何内容 都将成为捕获组（并在 re.findall 元组的长度上加一），除非您另有说明。要将 sub-group 变成 non-capturing 组，请在括号内添加 ?:：

phoneRegex = re.compile(r'''(
    (\d{3}|\(\d{3}\))?                 
    (\s|-|\.)?                        
    (\d{3})                           
    (\s|-|\.)                         
    (\d{4})                              
    (\s*(?:ext|x|ext.)\s*(?:\d{2,5}))?    # <---
    )''', re.VERBOSE)

您可以看到扩展部分添加了两个额外的捕获组。使用此更新版本，您的元组中将有 7 个项目。有 7 个而不是 6 个，因为整个字符串也匹配。

正则表达式也可以做得更好。这更干净，并且会匹配更多带有 re.IGNORECASE 标志的案例：

phoneRegex = re.compile(r'''(
    (\(?\d{3}\)?)                
    ([\s.-])?                        
    (\d{3})                           
    ([\s.-])                         
    (\d{4})                              
    \s*  # don't need to capture whitespace
    ((?:ext\.?|x)\s*(?:\d{1,5}))?
    )''', re.VERBOSE | re.IGNORECASE)

Python 和正则表达式：重新 findall() 的问题

Python and Regex: Problem with re findall()

findall

python-3.x

python-re