如何仅在特定字符串之后读取文本文件中的行?
How to only read lines in a text file after a certain string?
我想将文本文件中特定字符串之后的所有行读入字典。我想对数千个文本文件执行此操作。
我能够使用以下代码(从 this answer 获得)识别并打印出特定字符串 ('Abstract'
):
for files in filepath:
with open(files, 'r') as f:
for line in f:
if 'Abstract' in line:
print line;
但是我如何告诉 Python 开始阅读仅在字符串之后的行?
使用布尔值忽略到此为止的行:
found_abstract = False
for files in filepath:
with open(files, 'r') as f:
for line in f:
if 'Abstract' in line:
found_abstract = True
if found_abstract:
#do whatever you want
澄清一下,您的代码 已经 "reads" 所有行。要在某个点之后开始 "paying attention" 行,您可以设置一个布尔标志来指示是否应忽略行,并在每一行检查它。
pay_attention = False
for line in f:
if pay_attention:
print line
else: # We haven't found our trigger yet; see if it's in this line
if 'Abstract' in line:
pay_attention = True
如果您不介意重新排列代码,也可以改用两个部分循环:一个在您找到触发短语 ('Abstract'
) 后终止的循环,另一个读取以下所有行。这种方法更简洁(而且速度也非常快)。
for skippable_line in f: # First skim over all lines until we find 'Abstract'.
if 'Abstract' in skippable_line:
break
for line in f: # The file's iterator starts up again right where we left it.
print line
这样做的原因是 open
返回的文件对象的行为类似于 generator,而不是列表:它仅在请求时生成值。因此,当第一个循环停止时,文件的内部位置设置在第一个 "unread" 行的开头。这意味着当您进入第二个循环时,您看到的第一行是触发 break
.
的那行之后的第一行
当你到达你想开始的行时,再开始一个循环:
for files in filepath:
with open(files, 'r') as f:
for line in f:
if 'Abstract' in line:
for line in f: # now you are at the lines you want
# do work
文件对象是它自己的迭代器,所以当我们到达其中包含 'Abstract'
的行时,我们从该行继续迭代,直到我们使用了迭代器。
一个简单的例子:
gen = (n for n in xrange(8))
for x in gen:
if x == 3:
print('Starting second loop')
for x in gen:
print('In second loop', x)
else:
print('In first loop', x)
产生:
In first loop 0
In first loop 1
In first loop 2
Starting second loop
In second loop 4
In second loop 5
In second loop 6
In second loop 7
您还可以使用 itertools.dropwhile 将行消耗到您想要的点:
from itertools import dropwhile
for files in filepath:
with open(files, 'r') as f:
dropped = dropwhile(lambda _line: 'Abstract' not in _line, f)
next(dropped, '')
for line in dropped:
print(line)
这里可以使用itertools.dropwhile
和itertools.islice
,一个伪例子:
from itertools import dropwhile, islice
for fname in filepaths:
with open(fname) as fin:
start_at = dropwhile(lambda L: 'Abstract' not in L.split(), fin)
for line in islice(start_at, 1, None): # ignore the line still with Abstract in
print line
猜测一下字典是怎么涉及的,我会这样写:
lines = dict()
for filename in filepath:
with open(filename, 'r') as f:
for line in f:
if 'Abstract' in line:
break
lines[filename] = tuple(f)
因此对于每个文件,您的字典都包含一个行元组。
这是有效的,因为循环读取并包括您标识的行,使文件中的其余行准备好从 f
中读取。
对我来说,下面的代码更容易理解。
with open(file_name, 'r') as f:
while not 'Abstract' in next(f):
pass
for line in f:
#line will be now the next line after the one that contains 'Abstract'
我想将文本文件中特定字符串之后的所有行读入字典。我想对数千个文本文件执行此操作。
我能够使用以下代码(从 this answer 获得)识别并打印出特定字符串 ('Abstract'
):
for files in filepath:
with open(files, 'r') as f:
for line in f:
if 'Abstract' in line:
print line;
但是我如何告诉 Python 开始阅读仅在字符串之后的行?
使用布尔值忽略到此为止的行:
found_abstract = False
for files in filepath:
with open(files, 'r') as f:
for line in f:
if 'Abstract' in line:
found_abstract = True
if found_abstract:
#do whatever you want
澄清一下,您的代码 已经 "reads" 所有行。要在某个点之后开始 "paying attention" 行,您可以设置一个布尔标志来指示是否应忽略行,并在每一行检查它。
pay_attention = False
for line in f:
if pay_attention:
print line
else: # We haven't found our trigger yet; see if it's in this line
if 'Abstract' in line:
pay_attention = True
如果您不介意重新排列代码,也可以改用两个部分循环:一个在您找到触发短语 ('Abstract'
) 后终止的循环,另一个读取以下所有行。这种方法更简洁(而且速度也非常快)。
for skippable_line in f: # First skim over all lines until we find 'Abstract'.
if 'Abstract' in skippable_line:
break
for line in f: # The file's iterator starts up again right where we left it.
print line
这样做的原因是 open
返回的文件对象的行为类似于 generator,而不是列表:它仅在请求时生成值。因此,当第一个循环停止时,文件的内部位置设置在第一个 "unread" 行的开头。这意味着当您进入第二个循环时,您看到的第一行是触发 break
.
当你到达你想开始的行时,再开始一个循环:
for files in filepath:
with open(files, 'r') as f:
for line in f:
if 'Abstract' in line:
for line in f: # now you are at the lines you want
# do work
文件对象是它自己的迭代器,所以当我们到达其中包含 'Abstract'
的行时,我们从该行继续迭代,直到我们使用了迭代器。
一个简单的例子:
gen = (n for n in xrange(8))
for x in gen:
if x == 3:
print('Starting second loop')
for x in gen:
print('In second loop', x)
else:
print('In first loop', x)
产生:
In first loop 0
In first loop 1
In first loop 2
Starting second loop
In second loop 4
In second loop 5
In second loop 6
In second loop 7
您还可以使用 itertools.dropwhile 将行消耗到您想要的点:
from itertools import dropwhile
for files in filepath:
with open(files, 'r') as f:
dropped = dropwhile(lambda _line: 'Abstract' not in _line, f)
next(dropped, '')
for line in dropped:
print(line)
这里可以使用itertools.dropwhile
和itertools.islice
,一个伪例子:
from itertools import dropwhile, islice
for fname in filepaths:
with open(fname) as fin:
start_at = dropwhile(lambda L: 'Abstract' not in L.split(), fin)
for line in islice(start_at, 1, None): # ignore the line still with Abstract in
print line
猜测一下字典是怎么涉及的,我会这样写:
lines = dict()
for filename in filepath:
with open(filename, 'r') as f:
for line in f:
if 'Abstract' in line:
break
lines[filename] = tuple(f)
因此对于每个文件,您的字典都包含一个行元组。
这是有效的,因为循环读取并包括您标识的行,使文件中的其余行准备好从 f
中读取。
对我来说,下面的代码更容易理解。
with open(file_name, 'r') as f:
while not 'Abstract' in next(f):
pass
for line in f:
#line will be now the next line after the one that contains 'Abstract'