Python 从日志文件打印正则表达式组的问题

Python issue with printing regex group from log file

我无法从日志文件中打印两个正则表达式组。我没有得到任何错误,我只是没有得到任何结果。

我希望它们读作:

12345@email.com = 19290 45678@email.com = 23625

在这种情况下,我只想打印类别 2 中的帐户和高分数据。我是 Python 的新手,但我正在尝试通过实践学习更多信息。似乎我的正则表达式没有返回 python 中的任何匹配项,但是当我使用这个 Regex101 工具时,我得到了这两个组和我的正则表达式代码。也许问题是我如何打印组。 任何帮助将不胜感激,以便我可以从错误中吸取教训。 :)

这是我的代码:

import re

log = open(r"C:\CurrentLog.txt","r")
regex = re.compile("Category2-{25}\n.{51}(?P<Account>.{11}\.com).\.\.(?:$\n^.*){5}High Score = (?P<Score>\d{2,})", re.M)

for line in log:
    data = regex.findall(line)
    for word in data:
        print (line.group(Account))
        print (line.group(Score))

日志文件示例:

实际日志文件将保持在 400 - 600 行左右,所以我不担心将其加载到内存中。

2019-10-17 17:56:44,295 :: INFO :: root :: -------------------------Category1-------------------------
2019-10-17 17:56:49,988 :: INFO :: root :: Account 12345@email.com...
2019-10-17 17:57:09,328 :: INFO :: root :: other info 1
2019-10-17 18:00:22,267 :: INFO :: root :: other info 2
2019-10-17 18:00:22,582 :: INFO :: root :: High Score = 19090
2019-10-17 18:00:22,582 :: INFO :: root :: other info 3
2019-10-17 18:00:22,582 :: INFO :: root :: other info 4
2019-10-17 18:00:24,661 :: INFO :: root :: -------------------------Category2-------------------------
2019-10-17 18:00:29,619 :: INFO :: root :: Account 12345@email.com...
2019-10-17 18:00:46,317 :: INFO :: root :: other info 1
2019-10-17 18:05:46,088 :: INFO :: root :: other info 2
2019-10-17 18:05:52,451 :: INFO :: root :: other info 3
2019-10-17 18:08:11,765 :: INFO :: root :: other info 4
2019-10-17 18:08:12,813 :: INFO :: root :: High Score = 19290
2019-10-17 18:08:12,814 :: INFO :: root :: other info 5
2019-10-17 18:08:12,814 :: INFO :: root :: other info 6
2019-10-17 18:08:14,890 :: INFO :: root :: -------------------------Category1-------------------------
2019-10-17 18:08:19,860 :: INFO :: root :: Account 45678@email.com...
2019-10-17 18:08:37,188 :: INFO :: root :: other info 1
2019-10-17 18:13:23,232 :: INFO :: root :: other info 2
2019-10-17 18:13:23,595 :: INFO :: root :: High Score = 23425
2019-10-17 18:13:23,595 :: INFO :: root :: other info 3
2019-10-17 18:13:23,595 :: INFO :: root :: other info 4
2019-10-17 18:13:25,689 :: INFO :: root :: -------------------------Category2-------------------------
2019-10-17 18:13:30,660 :: INFO :: root :: Account 45678@email.com...
2019-10-17 18:13:47,727 :: INFO :: root :: other info 1
2019-10-17 18:16:20,327 :: INFO :: root :: other info 2
2019-10-17 18:16:26,907 :: INFO :: root :: other info 3
2019-10-17 18:18:44,376 :: INFO :: root :: other info 4
2019-10-17 18:18:45,447 :: INFO :: root :: High Score = 23625
2019-10-17 18:18:45,447 :: INFO :: root :: other info 5
2019-10-17 18:18:45,447 :: INFO :: root :: other info 6

如果您需要更多信息或上下文,请告诉我。

谢谢!

for line in log:
    data = regex.findall(line)

上面的代码块正在做的是在每一行上应用你的正则表达式,这将失败,因为你的正则表达式跨越多行。您需要对整个内容使用正则表达式。

下面的代码应该可以正常工作

import re
# Read the entire content from file into a variable
contents = open(r"log.txt", "r").read()
regex = re.compile("Category2-{25}\n.{51}(?P<Account>.{11}\.com).\.\.(?:$\n^.*){5}High Score = (?P<Score>\d{2,})", re.M)

# Find iter is like re.findall, just that it returns the captured regex group objects(Also that it returns a callable iterator, but thats not important to know here)
for match in regex.finditer(contents):
    print match.group('Account')
    print match.group('Score')

我觉得你把 Regex 复杂化了一点试试这个:

RE_PATTERN = re.compile(r'Account\s(?P<Account>.+?\.com).*?High Score = (?P<Score>\d+)', re.DOTALL)

#  read the entire the log as a text 
for match in RE_PATTERN.finditer(log.read()):
    print(match.group('Account'))
    print(match.group('Score'))

使用 re.DOTALL. 将匹配 \n,因此 .*? 将消耗任何内容,直到找到单词 High Score =

您可以尝试简化版的正则表达式:Category2-{25}\n.+Account\s+(.+)[\s\S]+?High Score = (.+)

Account\s+(.+) - 将匹配 Account 和一个或多个空格,因此它将匹配直到电子邮件地址,然后将匹配所有内容直到换行符(即整个电子邮件地址)并存储它在捕获组中。

另一个修改是[\s\S]+?,它匹配每个字符,一个或多个,非贪婪,直到匹配High Score。然后它在第二个捕获组中匹配并存储分数(在等号之后)。

Demo

下面的代码可以帮到你。我会给你一个包含电子邮件和分数的元组列表。

log_text = open(r"log.txt", "r").read()
regex = re.compile(r"Category2-{25}\n.{51}(?P<Account>.{11}\.com).\.\.(?:$\n^.*){5}High Score = (?P<Score>\d{2,})", re.M)
print(regex.findall(log_text))

输出

[('12345@email.com', '19290'), ('45678@email.com', '23625')]