如何仅在特定字符串之后读取文本文件中的行？

Question

我想将文本文件中特定字符串之后的所有行读入字典。我想对数千个文本文件执行此操作。

我能够使用以下代码（从 this answer 获得）识别并打印出特定字符串 ('Abstract')：

for files in filepath:
    with open(files, 'r') as f:
        for line in f:
            if 'Abstract' in line:
                print line;

但是我如何告诉 Python 开始阅读仅在字符串之后的行？

Answer 1

使用布尔值忽略到此为止的行：

found_abstract = False
for files in filepath:
    with open(files, 'r') as f:
        for line in f:
            if 'Abstract' in line:
                found_abstract = True
            if found_abstract:
                #do whatever you want

Answer 2

澄清一下，您的代码已经 "reads" 所有行。要在某个点之后开始 "paying attention" 行，您可以设置一个布尔标志来指示是否应忽略行，并在每一行检查它。

pay_attention = False
for line in f:
    if pay_attention:
        print line
    else:  # We haven't found our trigger yet; see if it's in this line
        if 'Abstract' in line:
            pay_attention = True

如果您不介意重新排列代码，也可以改用两个部分循环：一个在您找到触发短语 ('Abstract') 后终止的循环，另一个读取以下所有行。这种方法更简洁（而且速度也非常快）。

for skippable_line in f:  # First skim over all lines until we find 'Abstract'.
    if 'Abstract' in skippable_line:
        break
for line in f:  # The file's iterator starts up again right where we left it.
    print line

这样做的原因是 open 返回的文件对象的行为类似于 generator，而不是列表：它仅在请求时生成值。因此，当第一个循环停止时，文件的内部位置设置在第一个 "unread" 行的开头。这意味着当您进入第二个循环时，您看到的第一行是触发 break.

的那行之后的第一行

Answer 3

当你到达你想开始的行时，再开始一个循环：

for files in filepath:
    with open(files, 'r') as f:
        for line in f:
            if 'Abstract' in line:                
                for line in f: # now you are at the lines you want
                    # do work

文件对象是它自己的迭代器，所以当我们到达其中包含 'Abstract' 的行时，我们从该行继续迭代，直到我们使用了迭代器。

一个简单的例子：

gen = (n for n in xrange(8))

for x in gen:
    if x == 3:
        print('Starting second loop')
        for x in gen:
            print('In second loop', x)
    else:
        print('In first loop', x)

产生：

In first loop 0
In first loop 1
In first loop 2
Starting second loop
In second loop 4
In second loop 5
In second loop 6
In second loop 7

您还可以使用 itertools.dropwhile 将行消耗到您想要的点：

from itertools import dropwhile

for files in filepath:
    with open(files, 'r') as f:
        dropped = dropwhile(lambda _line: 'Abstract' not in _line, f)
        next(dropped, '')
        for line in dropped:
                print(line)

Answer 4

这里可以使用itertools.dropwhile和itertools.islice，一个伪例子：

from itertools import dropwhile, islice

for fname in filepaths:
    with open(fname) as fin:
        start_at = dropwhile(lambda L: 'Abstract' not in L.split(), fin)
        for line in islice(start_at, 1, None): # ignore the line still with Abstract in
            print line

Answer 5

猜测一下字典是怎么涉及的，我会这样写：

lines = dict()
for filename in filepath:
   with open(filename, 'r') as f:
       for line in f:
           if 'Abstract' in line:
               break
       lines[filename] = tuple(f)

因此对于每个文件，您的字典都包含一个行元组。

这是有效的，因为循环读取并包括您标识的行，使文件中的其余行准备好从 f 中读取。

Answer 6

对我来说，下面的代码更容易理解。

with open(file_name, 'r') as f:
    while not 'Abstract' in next(f):
        pass
    for line in f:
        #line will be now the next line after the one that contains 'Abstract'

如何仅在特定字符串之后读取文本文件中的行？

How to only read lines in a text file after a certain string?

python

string

file