从混合日志文件中提取关键数字
extract critical numbers from a mixed log file
我有一个日志文件包含许多这样的片段:
Align set A and merge into set B ...
setA, 4 images , image size 146 X 131
setA, image 1, shape center shift (7, -9) compared to image center
setA, image 2, shape center shift (8, -10) compared to image center
setA, image 3, shape center shift (6, -9) compared to image center
setA, image 4, shape center shift (6, -8) compared to image center
final set B, image size 143 X 129
Write set B ...
现在,我想将这个切片中的数字提取到 table:
| width_A | height_A | shift_x | shift_y | width_B | height_B|
--- | --- | --- | ----| ---
A1 | 146 | 131 | 7 | -9 | 143 | 129
A2 | 146 | 131 | 8 | -10 | 143 | 129
A3 | 146 | 131 | 6 | -9 | 143 | 129
A4 | 146 | 131 | 6 | -8 | 143 | 129
如果把程序分成两部分,则:
- 文本处理,将文本读入字典
data
,如data['A1']['shift_x'] = 7
。
- 使用pandas将字典转换为数据帧:
df = pd.DataFrame(data)
但是我不熟悉python文本处理:
- 与Python: How to loop through blocks of lines不同,我的日志文本组织得不是很好;
- 正则表达式可能是一种选择,但我一直记不住分类各种符号的技巧
有人对此有好的解决方案吗? Python 是首选。提前致谢。
终于自己找到答案了:
import re
# store attribute as a turple, construct a dictionary, turple_attribute: pattern
regexp = {
('title', ): re.compile(r'Merge (.*) into set B.*\n' ),
('nimages', 'height_A', 'width_A'): re.compile(r'\s+setA, (\d{1,}) images , image size (\d{1,}) X (\d{1,}).*\n'),
('image_no', 'shift_x', 'shift_y'): re.compile(r'\s+setA, image (\d{1,}), shape center shift \((-?\d{1,}), (-?\d{1,})\) compared to image center.*\n'),
('gauge_no', ): re.compile(r'Write gauge (\d{1,}), set B.*') }
with open(log_file) as f:
for line in f:
print(line)
for keys, pattern in regexp.iteritems():
m = pattern.match(line)
if m:
# traverse attributes
for groupn, attr in enumerate(keys):
# m.group(0): content of the entrire line
print str(groupn)+' '+attr + ' ' + m.group(groupn+1)
参考
- 没注意到这个问题才问,Extracting info from large structured text files
- Regular expression cheat table
我有一个日志文件包含许多这样的片段:
Align set A and merge into set B ...
setA, 4 images , image size 146 X 131
setA, image 1, shape center shift (7, -9) compared to image center
setA, image 2, shape center shift (8, -10) compared to image center
setA, image 3, shape center shift (6, -9) compared to image center
setA, image 4, shape center shift (6, -8) compared to image center
final set B, image size 143 X 129
Write set B ...
现在,我想将这个切片中的数字提取到 table:
| width_A | height_A | shift_x | shift_y | width_B | height_B|
--- | --- | --- | ----| ---
A1 | 146 | 131 | 7 | -9 | 143 | 129
A2 | 146 | 131 | 8 | -10 | 143 | 129
A3 | 146 | 131 | 6 | -9 | 143 | 129
A4 | 146 | 131 | 6 | -8 | 143 | 129
如果把程序分成两部分,则:
- 文本处理,将文本读入字典
data
,如data['A1']['shift_x'] = 7
。 - 使用pandas将字典转换为数据帧:
df = pd.DataFrame(data)
但是我不熟悉python文本处理:
- 与Python: How to loop through blocks of lines不同,我的日志文本组织得不是很好;
- 正则表达式可能是一种选择,但我一直记不住分类各种符号的技巧
有人对此有好的解决方案吗? Python 是首选。提前致谢。
终于自己找到答案了:
import re
# store attribute as a turple, construct a dictionary, turple_attribute: pattern
regexp = {
('title', ): re.compile(r'Merge (.*) into set B.*\n' ),
('nimages', 'height_A', 'width_A'): re.compile(r'\s+setA, (\d{1,}) images , image size (\d{1,}) X (\d{1,}).*\n'),
('image_no', 'shift_x', 'shift_y'): re.compile(r'\s+setA, image (\d{1,}), shape center shift \((-?\d{1,}), (-?\d{1,})\) compared to image center.*\n'),
('gauge_no', ): re.compile(r'Write gauge (\d{1,}), set B.*') }
with open(log_file) as f:
for line in f:
print(line)
for keys, pattern in regexp.iteritems():
m = pattern.match(line)
if m:
# traverse attributes
for groupn, attr in enumerate(keys):
# m.group(0): content of the entrire line
print str(groupn)+' '+attr + ' ' + m.group(groupn+1)
参考
- 没注意到这个问题才问,Extracting info from large structured text files
- Regular expression cheat table