Python 正则表达式分组查找器

Question

输入：146.204.224.152 - feest6811 [21/Jun/2019:15:45:24 -0700] "POST /incentivize HTTP/1.1" 302 4622

预期输出：

example_dict = {"host":"146.204.224.152", "user_name":"feest6811","time":"21/Jun/2019:15:45:24 -0700",
"request":"POST /incentivize HTTP/1.1"}

我的代码可以单独分组，例如：

for item in re.finditer('(?P<host>\d*\.\d*\.\d*.\d*)',logdata):
        print(item.groupdict())

Output: {'host': '146.204.224.152'}

但我并没有通过组合每个组来获得输出。下面是我的代码：

for item in re.finditer('(?P<host>\d*\.\d*\.\d*.\d*)(?P<user_name>(?<=-\s)[\w]+\d)(?P<time>(?<=\[).+(?=]))(?P<request>(?<=").+(?="))',logdata):
           print(item.groupdict())

Answer 1

我可能会简化您的正则表达式模式，只需在此处使用 re.findall：

inp = '146.204.224.152 - feest6811 [21/Jun/2019:15:45:24 -0700] "POST /incentivize HTTP/1.1" 302 4622'
matches = re.findall(r'(\d+\.\d+\.\d+\.\d+) - (\S+) \[(.*?)\] "(.*?)"', inp)
print(matches)

这将生成包含您想要的四个捕获术语的元组列表：

[('146.204.224.152', 'feest6811', '21/Jun/2019:15:45:24 -0700', 'POST /incentivize HTTP/1.1')]

Answer 2

如果您粘贴两个正则表达式 back-to-back，它们将只匹配文本 back-to-back。例如，如果组合 a 和 b，则正则表达式 ab 将匹配文本 ab，但不会匹配 acb.

您的组合正则表达式遇到了这个问题；您已经将正则表达式融合在一起，这些正则表达式显然单独运行良好，但它们不匹配紧邻的字符串，因此您必须添加一些填充以覆盖输入中的中间子字符串。

这是一个稍微重构的版本，其中包含添加填充的调整，以及一些例行修复以避免常见的正则表达式初学者错误。

for item in re.finditer(r'''
        (?P<host>\d+\.\d+\.\d+.\d+)
        (?:[-\s]+)
        (?P<user_name>\w+\d)
        (?:[^[]+\[)
        (?P<time>[^]]+)
        (?:\][^]"]+")
        (?P<request>[^"]+)''',
        logdata, re.VERBOSE):
    print(item.groupdict())

演示：https://ideone.com/BsNLG7

Python 正则表达式分组查找器

Python Regex Grouping finditer

python

regex

grouping

python-re