在 space 格式的报告中解析 multi-line header pyparsing
Parse a multi-line header in a space formatted report pyparsing
我正在尝试解析 table 中包含 multi-line header 的文件:
Categ_1 Categ_2 Categ_3 Categ_4
data1 Group Data Data Data Data ( %) Options
--------------------------------------------------------------------------------
param_group1 6.366e-03 6.644e-03 6.943e-05 0.0131 (57.42%) i
param_group2 1.251e-05 7.253e-06 4.256e-04 4.454e-04 ( 1.96%)
param_group3 2.205e-05 6.421e-05 2.352e-03 2.438e-03 (10.70%)
param_group4 1.579e-07 0.0000 1.479e-05 1.495e-05 ( 0.07%)
param_group5 3.985e-03 2.270e-07 2.789e-03 6.775e-03 (29.74%)
param_group6 0.0000 0.0000 0.0000 0.0000 ( 0.00%)
param_group7 -8.121e-09
0.0000 1.896e-08 1.084e-08 ( 0.00%)
我过去曾成功地使用 pyparsing 来解析这样的 table 但是 header 在一行中并且 header 字段的 none 有其中有多个 space ( %)
我是这样做的:
def mustMatchCols(startloc,endloc):
return lambda s,l,t: startloc <= col(l,s) <= endloc+1
def tableValue(expr, colstart, colend):
return Optional(expr.copy().addCondition(mustMatchCols(colstart,colend), message="text not in expected columns"))
if header:
column_lengths = determine_header_column_widths(header_line)
# Then run the tableValue function for each start,end pair.
是否有内置的 construct/examples 用于在 pyparsing 或任何其他方法中解析此类 space 格式的 tables?
如果您可以 pre-determine 您的列宽,那么这里是将多列 headers 拼接在一起的代码:
headers = """\
Categ_1 Categ_2 Categ_3 Categ_4
data1 Group Data Data Data Data ( %) Options
"""
col_widths = [24, 10, 10, 11, 9, 10, 10]
# convert widths to slices
col_slices = []
prev = 0
for cw in col_widths:
col_slices.append(slice(prev, prev + cw))
prev += cw
# verify slices
# for line in headers.splitlines():
# for slc in col_slices:
# print(line[slc])
def extract_line_parts(slices, line_string):
return [line_string[slc].strip() for slc in slices]
# extract the different column header parts
parts = [extract_line_parts(col_slices, line) for line in headers.splitlines()]
for p in parts:
print(p)
# use zip(*parts) to transpose list of row parts into list of column parts
header_cols = list(zip(*parts))
print(header_cols)
for header in header_cols:
print(' '.join(filter(None, header)))
打印:
['', 'Categ_1', 'Categ_2', 'Categ_3', 'Categ_4', '', '']
['data1 Group', 'Data', 'Data', 'Data', 'Data', '( %)', 'Options']
[('', 'data1 Group'), ('Categ_1', 'Data'), ('Categ_2', 'Data'), ('Categ_3', 'Data'), ('Categ_4', 'Data'), ('', '( %)'), ('', 'Options')]
data1 Group
Categ_1 Data
Categ_2 Data
Categ_3 Data
Categ_4 Data
( %)
Options
我正在尝试解析 table 中包含 multi-line header 的文件:
Categ_1 Categ_2 Categ_3 Categ_4
data1 Group Data Data Data Data ( %) Options
--------------------------------------------------------------------------------
param_group1 6.366e-03 6.644e-03 6.943e-05 0.0131 (57.42%) i
param_group2 1.251e-05 7.253e-06 4.256e-04 4.454e-04 ( 1.96%)
param_group3 2.205e-05 6.421e-05 2.352e-03 2.438e-03 (10.70%)
param_group4 1.579e-07 0.0000 1.479e-05 1.495e-05 ( 0.07%)
param_group5 3.985e-03 2.270e-07 2.789e-03 6.775e-03 (29.74%)
param_group6 0.0000 0.0000 0.0000 0.0000 ( 0.00%)
param_group7 -8.121e-09
0.0000 1.896e-08 1.084e-08 ( 0.00%)
我过去曾成功地使用 pyparsing 来解析这样的 table 但是 header 在一行中并且 header 字段的 none 有其中有多个 space ( %)
我是这样做的:
def mustMatchCols(startloc,endloc):
return lambda s,l,t: startloc <= col(l,s) <= endloc+1
def tableValue(expr, colstart, colend):
return Optional(expr.copy().addCondition(mustMatchCols(colstart,colend), message="text not in expected columns"))
if header:
column_lengths = determine_header_column_widths(header_line)
# Then run the tableValue function for each start,end pair.
是否有内置的 construct/examples 用于在 pyparsing 或任何其他方法中解析此类 space 格式的 tables?
如果您可以 pre-determine 您的列宽,那么这里是将多列 headers 拼接在一起的代码:
headers = """\
Categ_1 Categ_2 Categ_3 Categ_4
data1 Group Data Data Data Data ( %) Options
"""
col_widths = [24, 10, 10, 11, 9, 10, 10]
# convert widths to slices
col_slices = []
prev = 0
for cw in col_widths:
col_slices.append(slice(prev, prev + cw))
prev += cw
# verify slices
# for line in headers.splitlines():
# for slc in col_slices:
# print(line[slc])
def extract_line_parts(slices, line_string):
return [line_string[slc].strip() for slc in slices]
# extract the different column header parts
parts = [extract_line_parts(col_slices, line) for line in headers.splitlines()]
for p in parts:
print(p)
# use zip(*parts) to transpose list of row parts into list of column parts
header_cols = list(zip(*parts))
print(header_cols)
for header in header_cols:
print(' '.join(filter(None, header)))
打印:
['', 'Categ_1', 'Categ_2', 'Categ_3', 'Categ_4', '', '']
['data1 Group', 'Data', 'Data', 'Data', 'Data', '( %)', 'Options']
[('', 'data1 Group'), ('Categ_1', 'Data'), ('Categ_2', 'Data'), ('Categ_3', 'Data'), ('Categ_4', 'Data'), ('', '( %)'), ('', 'Options')]
data1 Group
Categ_1 Data
Categ_2 Data
Categ_3 Data
Categ_4 Data
( %)
Options