如何按时间读取日志文件并提取不包含日期信息的特定行

Question

我有一个日志文件，它的格式是这样的，

INFO    2018/11/20 18:56:00 aaaaaaaaaaaaaaaaaaaaaaaaaaaa
INFO    2018/11/20 18:56:00 bbbbbbbbbbbbbbbbbbbbbb
INFO    2018/11/20 18:56:00 cccccccccccccccccccccccccccc
INFO    2018/11/20 18:56:00 ddddddddddddddddddddddd
WARN    2018/11/20 18:56:23 Some Error Message
java.lang.IllegalArgumentException: blahblahblah
INFO    2018/11/20 19:01:23 eeeeeeeeeeeeeeeeeeeeeeeee

我不关心普通日志，但我想提取包含 "Exception" 单词的行，该行应该在某些时候写入（比如在 18:00:00 和 [=22= 之间） ].) 我首先想到的是在读取日志文件时使用枚举函数获取索引。但是有了这个，我必须至少阅读文件三遍以上。并且 linecache 函数将文件中的每一行都加载到内存中。有些文件超过 100MB，所以我知道这是个坏主意。

start = 0
end = 0
with open("filename", "f") as f:
    for idx, line in enumerate(f):
        if re.search("2018(\/|:|)11(\/|:|)20 18:\d{2}:\d{2}", line):
            start = idx
            break

    for idx, line in enumerate(f):
        if re.search("2018(\/|:|)11(\/|:|)20 19:\d{2}:\d{2}", line):
            end = idx - 1
            break    

for i in range(start, end):
    line = linecache.getline("filename", i)
    if 'Exception' in line:
        print line

最关键的问题是日志并不总是写在xx:00m或xx:59m上。例如，它将在 18:01:00 或 18:03:31..

开始

从昨天开始我一直没有想出什么好主意。请帮我.. 提前致谢。

Answer 1

你能逐行读取文件吗？

with open('test.txt', 'r') as f:
    lines = f.readlines()
for line in lines:
    if line.find('Exception') >= 0:
        print(line)

Answer 2

您实际上不必遍历文件 3 次。只需在循环中维护当前行和上一行。

from collections import OrderedDict
import re

result = OrderedDict()

with open("filename", "r") as f:
    prev, curr = None, None
    for id, line in enumerate(f):
        prev = curr
        curr = line
        if re.search('Exception', line):
            if re.search('18:\d{2}:\d{2}', prev):
                result[id] = line

print(result)

输出：

OrderedDict([(5, 'java.lang.IllegalArgumentException: blahblahblah\n')])

如果您希望从日志文件中获取所有 1 小时时段的行号，则只需将 '18' 替换为一些变量即可。

如何按时间读取日志文件并提取不包含日期信息的特定行

How to read log file time by time and extract the specific line not containning date info

python

algorithm

logging

readfile