解析日志 - 如何读取部分行
parsing logs - how to read part of line
我正在尝试编写一些东西来解析和报告一个非常大且详细的日志文件的非常具体的部分。
基本上结构可以描述为:
Stuff I don't care about
Stuff I don't care about
Stuff I don't care about
More stuff I don't care about
DEBUG 2015-03-13 01:20:03 transfer.py:200 New transfer candidates: set([''])
Stuff I don't care about
Stuff I don't care about
Stuff I don't care about
More stuff I don't care about
DEBUG 2015-03-13 01:20:03 transfer.py:200 New transfer candidates: set(['foo/bar'])
Lots more stuff I don't care about
Even more stuff I don't care about
Still more stuff I don't care about
INFO 2015-03-13 09:00:01 transfer.py:363 Status info: {u'status': u'COMPLETE', u'name': u'bar', u'path': u'irrelevant content', u'directory': u'irrelevant content', u'microservice': u'Remove the processing directory', u'message': u'Fetched status for 67646105-2c08-47ec-93d1-b7d3f3b43d13 successfully.', u'type': u'SIP', u'uuid': u'67646105-2c08-47ec-93d1-b7d3f3b43d13'}
我想做的是逐行读取文件,找到 New transfer candidates
的任何实例,其中 set([''])
的内容不为空。在这种情况下,我想获取字符串(在本例中为 'foo/bar'
)并将其放入变量中。我还想将该行的时间戳记在一个变量中。
当我继续逐行阅读时,我还想寻找包含 Status info: {u'status': u'COMPLETE
" 的行,然后我想使用 "name"(即 u'name': u'bar'
) 并将其放入变量中(在本例中为 'bar'
)。与上面相同,我想将时间戳放入变量中。
此处的目的主要是查看传输何时开始以及何时完成。我写了一些可笑的基本废话:
#!/usr/bin/env python
import argparse
parser = argparse.ArgumentParser(description=
"Python tool for generating performance statistics from Archivematica's "
"Automation-tools log file")
parser.add_argument('-i', '--input', type=file, help='log file to read')
args = parser.parse_args()
if not (args.input):
parser.error('you did not specify a log file')
log = args.input
x = 0
for line in log:
if 'New transfer candidates' in line:
x = x+1
print x
我的问题是我不太确定如何在这些行的各个部分中找到我正在寻找的这些字符串?
使用 re
module in the standard library or the open-source pyparsing
模块。
以下示例显示如何使用 re
来解析包含设置数据的行。
#!/usr/bin/env python
import argparse
import re
parser = argparse.ArgumentParser(description="Python tool for generating performance statistics from Archivematica's Automation-tools log file")
parser.add_argument('-i', '--input', type=file, help='log file to read')
args = parser.parse_args()
if not (args.input):
parser.error('you did not specify a log file')
log = args.input
x = 0
regex1 = re.compile("New transfer candidates: set\(\['(.+)'\]\)")
for line in log:
if 'New transfer candidates' in line:
m = regex1.search(line)
if m:
print m.group(1)
x = x+1
print x
这应该让你开始:
import time
import re
import ast
with open('input.txt') as logfile:
for line in logfile:
line = line.strip()
# search for level and timestamp
match = re.match(r'(\S+)\s+(\S{10} \S{8})\s*(\S.*)$', line)
if match:
level = match.group(1)
timestr = match.group(2)
timestamp = time.mktime(time.strptime(timestr, '%Y-%m-%d %H:%M:%S'))
message = match.group(3)
# transfer candidates
match = re.match(r'.*New transfer candidates: set\((.*)\)', message)
if match:
candidates = ast.literal_eval(match.group(1))
print 'New transfer candidate:', candidates
continue
# status info
match = re.match(r'.*Status info: (.*)$', message)
if match:
info = ast.literal_eval(match.group(1))
print 'Status info:', info
continue
print 'Unrecognized message.'
else:
print 'Unrecognized line.'
我正在尝试编写一些东西来解析和报告一个非常大且详细的日志文件的非常具体的部分。
基本上结构可以描述为:
Stuff I don't care about
Stuff I don't care about
Stuff I don't care about
More stuff I don't care about
DEBUG 2015-03-13 01:20:03 transfer.py:200 New transfer candidates: set([''])
Stuff I don't care about
Stuff I don't care about
Stuff I don't care about
More stuff I don't care about
DEBUG 2015-03-13 01:20:03 transfer.py:200 New transfer candidates: set(['foo/bar'])
Lots more stuff I don't care about
Even more stuff I don't care about
Still more stuff I don't care about
INFO 2015-03-13 09:00:01 transfer.py:363 Status info: {u'status': u'COMPLETE', u'name': u'bar', u'path': u'irrelevant content', u'directory': u'irrelevant content', u'microservice': u'Remove the processing directory', u'message': u'Fetched status for 67646105-2c08-47ec-93d1-b7d3f3b43d13 successfully.', u'type': u'SIP', u'uuid': u'67646105-2c08-47ec-93d1-b7d3f3b43d13'}
我想做的是逐行读取文件,找到 New transfer candidates
的任何实例,其中 set([''])
的内容不为空。在这种情况下,我想获取字符串(在本例中为 'foo/bar'
)并将其放入变量中。我还想将该行的时间戳记在一个变量中。
当我继续逐行阅读时,我还想寻找包含 Status info: {u'status': u'COMPLETE
" 的行,然后我想使用 "name"(即 u'name': u'bar'
) 并将其放入变量中(在本例中为 'bar'
)。与上面相同,我想将时间戳放入变量中。
此处的目的主要是查看传输何时开始以及何时完成。我写了一些可笑的基本废话:
#!/usr/bin/env python
import argparse
parser = argparse.ArgumentParser(description=
"Python tool for generating performance statistics from Archivematica's "
"Automation-tools log file")
parser.add_argument('-i', '--input', type=file, help='log file to read')
args = parser.parse_args()
if not (args.input):
parser.error('you did not specify a log file')
log = args.input
x = 0
for line in log:
if 'New transfer candidates' in line:
x = x+1
print x
我的问题是我不太确定如何在这些行的各个部分中找到我正在寻找的这些字符串?
使用 re
module in the standard library or the open-source pyparsing
模块。
以下示例显示如何使用 re
来解析包含设置数据的行。
#!/usr/bin/env python
import argparse
import re
parser = argparse.ArgumentParser(description="Python tool for generating performance statistics from Archivematica's Automation-tools log file")
parser.add_argument('-i', '--input', type=file, help='log file to read')
args = parser.parse_args()
if not (args.input):
parser.error('you did not specify a log file')
log = args.input
x = 0
regex1 = re.compile("New transfer candidates: set\(\['(.+)'\]\)")
for line in log:
if 'New transfer candidates' in line:
m = regex1.search(line)
if m:
print m.group(1)
x = x+1
print x
这应该让你开始:
import time
import re
import ast
with open('input.txt') as logfile:
for line in logfile:
line = line.strip()
# search for level and timestamp
match = re.match(r'(\S+)\s+(\S{10} \S{8})\s*(\S.*)$', line)
if match:
level = match.group(1)
timestr = match.group(2)
timestamp = time.mktime(time.strptime(timestr, '%Y-%m-%d %H:%M:%S'))
message = match.group(3)
# transfer candidates
match = re.match(r'.*New transfer candidates: set\((.*)\)', message)
if match:
candidates = ast.literal_eval(match.group(1))
print 'New transfer candidate:', candidates
continue
# status info
match = re.match(r'.*Status info: (.*)$', message)
if match:
info = ast.literal_eval(match.group(1))
print 'Status info:', info
continue
print 'Unrecognized message.'
else:
print 'Unrecognized line.'