如何根据单词将部分文本传递给 python 中的 tuple/list
how to pass portions of text to tuple/list in python based on word
我有以下示例文本,需要根据单词 "ALL Banks Report" 将所有文本行传递给 tuple/list。原始文本如下
%Bank PARSED MESSAGE FILE
%VERSION : PIL 98.7
%nex MODULE : SIL 98
2018 Jan 31 16:44:53.050 ALL Banks Report SBI
name id ID = 0, ID = 58
Freq = 180
conserved NEXT:
message c1 : ABC1 :
{
XYZ2
}
2018 Jan 31 16:44:43.050 ALL Banks Report HDFC
conserved LATE:
World ::=
{
Asia c1 : EastAsia :
{
India
}
}
...就像很多重复一样
我想根据单词 "ALL Banks Report" 传递 tuple/List/array 以便在列表 [0] 中出现以下内容
2018 Jan 31 16:44:53.050 ALL Banks Report SBI
name id ID = 0, ID = 58
Freq = 180
conserved NEXT:
message c1 : ABC1 :
{
XYZ2
}
在列表[1]中,其余的如下所示
2018 Jan 31 16:44:43.050 ALL Banks Report HDFC
conserved LATE:
World ::=
{
Asia c1 : EastAsia :
{
India
}
}
IMO,这里使用 pyparsing 没有特别的优势。使用老式算法很容易处理这个文件。
output_list = []
items = []
with open('spark.txt') as spark:
for line in spark:
line = line.rstrip()
if line and not line.startswith('%'):
if 'ALL Banks Report' in line:
if items:
output_list.extend(items)
items = [line]
else:
items.append(line)
if items:
output_list.extend(items)
for item in output_list:
print (item)
输出:
2018 Jan 31 16:44:53.050 ALL Banks Report SBI
name id ID = 0, ID = 58
Freq = 180
conserved NEXT:
message c1 : ABC1 :
{
XYZ2
}
2018 Jan 31 16:44:43.050 ALL Banks Report HDFC
conserved LATE:
World ::=
{
Asia c1 : EastAsia :
{
India
}
}
顺便说一句,我避免使用 list
作为标识符,因为它是 Python 关键字。
我是 itertools.groupby
的忠实粉丝,这里有一种使用它来查找您的银行行组的非常规方法:
from itertools import groupby
is_header = lambda s: "ALL Banks Report" in s
lines = sample.splitlines()
# call groupby to group lines by whether or not the line is a header or not
group_iter = groupby(lines, key=is_header)
# skip over leading group of non-header lines if the first line is not a header
if not is_header(lines[0]):
next(group_iter)
groups = []
while True:
head_lines = next(group_iter, None)
# no more lines? we're done
if head_lines is None:
break
# extract header lines, which is required before trying to advance the groupby iter
head_lines = list(head_lines[1])
# if there were multiple header lines in a row, with no bodies, create group items for them
while len(head_lines) > 1:
groups.append([head_lines.pop(0)])
# get next set of lines which are NOT header lines
body_lines = next(group_iter, (None, []))
# extract body lines, which is required before trying to advance the groupby iter
body_lines = list(body_lines[1])
# we've found a head line and a body, save it as a single list
groups.append(head_lines + body_lines)
# what did we get?
for group in groups:
print('--------------')
print('\n'.join(group))
print('')
你的数据集给出:
--------------
2018 Jan 31 16:44:53.050 ALL Banks Report SBI
name id ID = 0, ID = 58
Freq = 180
conserved NEXT:
message c1 : ABC1 :
{
XYZ2
}
--------------
2018 Jan 31 16:44:43.050 ALL Banks Report HDFC
conserved LATE:
World ::=
{
Asia c1 : EastAsia :
{
India
}
}
我有以下示例文本,需要根据单词 "ALL Banks Report" 将所有文本行传递给 tuple/list。原始文本如下
%Bank PARSED MESSAGE FILE
%VERSION : PIL 98.7
%nex MODULE : SIL 98
2018 Jan 31 16:44:53.050 ALL Banks Report SBI
name id ID = 0, ID = 58
Freq = 180
conserved NEXT:
message c1 : ABC1 :
{
XYZ2
}
2018 Jan 31 16:44:43.050 ALL Banks Report HDFC
conserved LATE:
World ::=
{
Asia c1 : EastAsia :
{
India
}
}
...就像很多重复一样 我想根据单词 "ALL Banks Report" 传递 tuple/List/array 以便在列表 [0] 中出现以下内容
2018 Jan 31 16:44:53.050 ALL Banks Report SBI
name id ID = 0, ID = 58
Freq = 180
conserved NEXT:
message c1 : ABC1 :
{
XYZ2
}
在列表[1]中,其余的如下所示
2018 Jan 31 16:44:43.050 ALL Banks Report HDFC
conserved LATE:
World ::=
{
Asia c1 : EastAsia :
{
India
}
}
IMO,这里使用 pyparsing 没有特别的优势。使用老式算法很容易处理这个文件。
output_list = []
items = []
with open('spark.txt') as spark:
for line in spark:
line = line.rstrip()
if line and not line.startswith('%'):
if 'ALL Banks Report' in line:
if items:
output_list.extend(items)
items = [line]
else:
items.append(line)
if items:
output_list.extend(items)
for item in output_list:
print (item)
输出:
2018 Jan 31 16:44:53.050 ALL Banks Report SBI
name id ID = 0, ID = 58
Freq = 180
conserved NEXT:
message c1 : ABC1 :
{
XYZ2
}
2018 Jan 31 16:44:43.050 ALL Banks Report HDFC
conserved LATE:
World ::=
{
Asia c1 : EastAsia :
{
India
}
}
顺便说一句,我避免使用 list
作为标识符,因为它是 Python 关键字。
我是 itertools.groupby
的忠实粉丝,这里有一种使用它来查找您的银行行组的非常规方法:
from itertools import groupby
is_header = lambda s: "ALL Banks Report" in s
lines = sample.splitlines()
# call groupby to group lines by whether or not the line is a header or not
group_iter = groupby(lines, key=is_header)
# skip over leading group of non-header lines if the first line is not a header
if not is_header(lines[0]):
next(group_iter)
groups = []
while True:
head_lines = next(group_iter, None)
# no more lines? we're done
if head_lines is None:
break
# extract header lines, which is required before trying to advance the groupby iter
head_lines = list(head_lines[1])
# if there were multiple header lines in a row, with no bodies, create group items for them
while len(head_lines) > 1:
groups.append([head_lines.pop(0)])
# get next set of lines which are NOT header lines
body_lines = next(group_iter, (None, []))
# extract body lines, which is required before trying to advance the groupby iter
body_lines = list(body_lines[1])
# we've found a head line and a body, save it as a single list
groups.append(head_lines + body_lines)
# what did we get?
for group in groups:
print('--------------')
print('\n'.join(group))
print('')
你的数据集给出:
--------------
2018 Jan 31 16:44:53.050 ALL Banks Report SBI
name id ID = 0, ID = 58
Freq = 180
conserved NEXT:
message c1 : ABC1 :
{
XYZ2
}
--------------
2018 Jan 31 16:44:43.050 ALL Banks Report HDFC
conserved LATE:
World ::=
{
Asia c1 : EastAsia :
{
India
}
}