解析日志 - 如何读取部分行

parsing logs - how to read part of line

我正在尝试编写一些东西来解析和报告一个非常大且详细的日志文件的非常具体的部分。

基本上结构可以描述为:

Stuff I don't care about
Stuff I don't care about
Stuff I don't care about
More stuff I don't care about
DEBUG     2015-03-13 01:20:03  transfer.py:200  New transfer candidates: set([''])
Stuff I don't care about
Stuff I don't care about
Stuff I don't care about
More stuff I don't care about
DEBUG     2015-03-13 01:20:03  transfer.py:200  New transfer candidates: set(['foo/bar'])
Lots more stuff I don't care about
Even more stuff I don't care about
Still more stuff I don't care about
INFO      2015-03-13 09:00:01  transfer.py:363  Status info: {u'status': u'COMPLETE', u'name': u'bar', u'path': u'irrelevant content', u'directory': u'irrelevant content', u'microservice': u'Remove the processing directory', u'message': u'Fetched status for 67646105-2c08-47ec-93d1-b7d3f3b43d13 successfully.', u'type': u'SIP', u'uuid': u'67646105-2c08-47ec-93d1-b7d3f3b43d13'}

我想做的是逐行读取文件,找到 New transfer candidates 的任何实例,其中 set(['']) 的内容不为空。在这种情况下,我想获取字符串(在本例中为 'foo/bar')并将其放入变量中。我还想将该行的时间戳记在一个变量中。

当我继续逐行阅读时,我还想寻找包含 Status info: {u'status': u'COMPLETE" 的行,然后我想使用 "name"(即 u'name': u'bar' ) 并将其放入变量中(在本例中为 'bar')。与上面相同,我想将时间戳放入变量中。

此处的目的主要是查看传输何时开始以及何时完成。我写了一些可笑的基本废话:

#!/usr/bin/env python

import argparse

parser = argparse.ArgumentParser(description=
    "Python tool for generating performance statistics from Archivematica's "
    "Automation-tools log file")
parser.add_argument('-i', '--input', type=file, help='log file to read')
args = parser.parse_args()
if not (args.input):
    parser.error('you did not specify a log file')

log = args.input
x = 0
for line in log:
    if 'New transfer candidates' in line:
        x = x+1
print x

我的问题是我不太确定如何在这些行的各个部分中找到我正在寻找的这些字符串?

使用 re module in the standard library or the open-source pyparsing 模块。

以下示例显示如何使用 re 来解析包含设置数据的行。

#!/usr/bin/env python

import argparse
import re

parser = argparse.ArgumentParser(description="Python tool for generating performance statistics from Archivematica's Automation-tools log file")
parser.add_argument('-i', '--input', type=file, help='log file to read')
args = parser.parse_args()

if not (args.input):
    parser.error('you did not specify a log file')

log = args.input

x = 0
regex1 = re.compile("New transfer candidates: set\(\['(.+)'\]\)")
for line in log:
    if 'New transfer candidates' in line:
        m = regex1.search(line)
        if m:
            print m.group(1)
        x = x+1
print x

这应该让你开始:

import time
import re
import ast

with open('input.txt') as logfile:
    for line in logfile:
        line = line.strip()
        # search for level and timestamp
        match = re.match(r'(\S+)\s+(\S{10} \S{8})\s*(\S.*)$', line)
        if match:
            level = match.group(1)
            timestr = match.group(2)
            timestamp = time.mktime(time.strptime(timestr, '%Y-%m-%d %H:%M:%S'))
            message = match.group(3)

            # transfer candidates
            match = re.match(r'.*New transfer candidates: set\((.*)\)', message)
            if match:
                candidates = ast.literal_eval(match.group(1))
                print 'New transfer candidate:', candidates
                continue

            # status info
            match = re.match(r'.*Status info: (.*)$', message)
            if match:
                info = ast.literal_eval(match.group(1))
                print 'Status info:', info
                continue

            print 'Unrecognized message.'
        else:
            print 'Unrecognized line.'