使用 python 中指定的分隔符逐块读取文件

Question

我有一个这样的 input_file.fa 文件（FASTA 格式）：

> header1 description
data data
data
>header2 description
more data
data
data

我想一次一个块地读入文件，这样每个块都包含一个头和相应的数据，例如第 1 块：

> header1 description
data data
data

当然我可以像这样读取文件并拆分：

with open("1.fa") as f:
    for block in f.read().split(">"):
        pass

但是我想避免将整个文件读入内存，因为文件通常很大。

我当然可以逐行读取文件：

with open("input_file.fa") as f:
    for line in f:
        pass

但理想情况下我想要的是这样的：

with open("input_file.fa", newline=">") as f:
    for block in f:
        pass

但是我得到一个错误：

ValueError: illegal newline value: >

我也尝试过使用 csv module，但没有成功。

我确实找到了3年前的this post，它提供了一个基于生成器的解决方案来解决这个问题，但它看起来并不那么紧凑，这真的是only/best的解决方案吗？如果可以用一行而不是一个单独的函数来创建生成器，那就太好了，就像这样的伪代码：

with open("input_file.fa") as f:
    blocks = magic_generator_split_by_>
    for block in blocks:
        pass

如果这是不可能的，那么我想你可以认为我的问题与另一个问题重复 post，但如果是这样，我希望人们能向我解释为什么另一个解决方案是唯一的.非常感谢。

Answer 1

这里的通用解决方案是为此编写一个生成器函数，一次生成一个组。这是您一次只能在内存中存储一组。

def get_groups(seq, group_by):
    data = []
    for line in seq:
        # Here the `startswith()` logic can be replaced with other
        # condition(s) depending on the requirement.
        if line.startswith(group_by):
            if data:
                yield data
                data = []
        data.append(line)

    if data:
        yield data

with open('input.txt') as f:
    for i, group in enumerate(get_groups(f, ">"), start=1):
        print ("Group #{}".format(i))
        print ("".join(group))

输出：

Group #1
> header1 description
data data
data

Group #2
>header2 description
more data
data
data

对于一般的 FASTA 格式，我建议使用 Biopython 包。

Answer 2

def read_blocks(file):
    block = ''
    for line in file:
        if line.startswith('>') and len(block)>0:
            yield block
            block = ''
        block += line
    yield block


with open('input_file.fa') as f:
    for block in read_blocks(f):
        print(block)

这将逐行读取文件，您将使用 yield 语句取回块。这很懒惰，因此您不必担心内存占用过大。

Answer 3

我喜欢的一种方法是将 itertools.groupby 与简单的 key 函数一起使用：

from itertools import groupby


def make_grouper():
    counter = 0
    def key(line):
        nonlocal counter
        if line.startswith('>'):
            counter += 1
        return counter
    return key

用作：

with open('filename') as f:
    for k, group in groupby(f, key=make_grouper()):
        fasta_section = ''.join(group)   # or list(group)

仅当您必须将整个部分的内容作为单个字符串处理时才需要 join。如果您只想逐行阅读，您可以简单地执行以下操作：

with open('filename') as f:
    for k, group in groupby(f, key=make_grouper()):
        # parse >header description
        header, description = next(group)[1:].split(maxsplit=1)
        for line in group:
            # handle the contents of the section line by line

使用 python 中指定的分隔符逐块读取文件

Reading in file block by block using specified delimiter in python

python

bioinformatics

fasta

python-3.x