从特定行之间的文件中提取部分

Question

我想知道，如何从大数据文件中提取特定范围内的一些数据？有没有办法读取以 "buzzwords".

开头和结尾的内容

我想在 *NODE 和 **

之间逐行阅读

*NODE
13021145,       2637.6073002472617,       55.011929824413045,        206.0394346892517
13021146,       2637.6051226039867,        55.21115693303926,       206.05686503802065
13021147,        2634.226986419154,        54.98263035830583,        205.9520084547658
13021148,        2634.224808775879,       55.181857466932044,       205.96943880353476
**

在*NODE之前和**之后有几千行...

我知道它看起来应该类似于：

a = []

with open('file.txt') as file:
   for line in file:
      if line.startswith('*NODE'):

      # NOW THERE SHOULD FOLLOW SOMETHING LIKE:
      #   Go to next line and "a.append" till there comes the "magical"
      #   "**"

有什么想法吗？我是 python 的新手。感谢帮助！我希望你明白我的意思。

Answer 1

您几乎做到了 - 唯一缺少的是，一旦找到开头，您就会搜索序列结尾，直到发生这种情况，将您要迭代的每一行附加到您的列表中。即：

data = None  # a placeholder to store your lines
with open("file.txt", "r") as f:  # do not shadow the built-in `file`
    for line in f:  # iterate over the lines
        if data is None:  # we haven't found `NODE*` yet
            if line[:5] == "NODE*":  # search for `NODE*` at the line beginning
                data = []  # make `data` an empty list to begin collecting
        elif line[:2] == "**":  # data initialized, we look for the sequence's end
            break  # no need to iterate over the file anymore
        else:  # data initialized but not at the end...
            data.append(line)  # append the line to our data

现在 data 将包含 NODE* 和 ** 之间的行列表，或者 None 如果未找到序列。

Answer 2

试试这个：

 with open('file.txt') as file:
    a = []
    running = False  # avoid NameError when 'if' statement below isn't reached
    for line in file:
        if line.startswith('*NODE'):
            running = True  # show that we are starting to add values
            continue  # make sure we don't add '*NODE'
        if line.startswith('**'):
            running = False  # show that we're done adding values
            continue  # make sure we don't add '**'
        if running:  # only add the values if 'running' is True
            a.extend([i.strip() for i in line.split(',')])

输出是一个包含以下字符串的列表： （我用的是print('\n'.join(a))）

13021145 2637.6073002472617 55.011929824413045 206.0394346892517 13021146 2637.6051226039867 55.21115693303926 206.05686503802065 13021147 2634.226986419154 54.98263035830583 205.9520084547658 13021148 2634.224808775879 55.181857466932044 205.96943880353476

Answer 3

我们可以遍历行，直到没有任何剩余或者我们已经到达块的末尾，如

a = []

with open('file.txt') as file:
    for line in file:
        if line.startswith('*NODE'):
            # collect block-related lines
            while True:
                try:
                    line = next(file)
                except StopIteration:
                    # there is no lines left
                    break
                if line.startswith('**'):
                    # we've reached the end of block
                    break
                a.append(line)
            # stop iterating over file
            break

会给我们

print(a)
['13021145,       2637.6073002472617,       55.011929824413045,        206.0394346892517\n',
 '13021146,       2637.6051226039867,        55.21115693303926,       206.05686503802065\n',
 '13021147,        2634.226986419154,        54.98263035830583,        205.9520084547658\n',
 '13021148,        2634.224808775879,       55.181857466932044,       205.96943880353476\n']

或者我们可以编写辅助谓词，例如

def not_a_block_start(line):
    return not line.startswith('*NODE')


def not_a_block_end(line):
    return not line.startswith('**')

然后像

一样使用itertools module的光彩

from itertools import (dropwhile,
                       takewhile)    

with open('file.txt') as file:
    block_start = dropwhile(not_a_block_start, file)
    # skip block start line
    next(block_start)
    a = list(takewhile(not_a_block_end, block_start))

这将为我们提供相同的值 a。

从特定行之间的文件中提取部分

Extract a part from a file between specific lines

python

input

with-statement

text-files