使用来自另一个 .txt 的行创建一个新的 .txt

Creating a a new .txt with lines from another .txt

我有一个这种结构的文档(很大,超过 20000 行)

@A00627:308:H227VDSX3:1:1201:30734:26349 2:N:0:TGGCAGTA+GTACAGTG
CCCAGGAGCACCAGGAAGGGCAAGAGCACCCTGGCCTAGGGGATCATCTGGCCCAGGGTAGGGTAGGAACAGCCTCATGGTCTTCAGAGTTTGCCCCTTCCTGAGGGAAAGACATTTTAATATTTTTGGGTTGGCTGGACCAATCTCATT
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFF:FFFFFF:F:FFFFFFFFFFFF
@A00627:308:H227VDSX3:1:1257:18828:34695 2:N:0:TGGCAGTA+GTACAGTG
CTGGCCTAGGGGATCATCTGGCCCAGGGTAGGGTAGGAACAGCCTCATGGTCTTCAGAGTTTGCCCCTTCCTGAGGGAAAGACATTTTAATATTTTTGGGTTGGCTGGACCAATCTCATTAAGAGAAGAGAAGAAACGCCCACGCCAGGA
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFF:FFFFFFFF,FFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@A00627:308:H227VDSX3:1:1266:28809:10300 2:N:0:TGGCAGTA+GTACAGTG
CTGGCCCAGGGTAGGGTAGGAACAGCCTCATGGTCTTCAGAGTTTGCCCCTTCCTGAGGGAAAGACATTTTAATATTTTTGGGTTGGCTGGACCAATCTCATTAAGAGAAGAGAAGAAACGCCCACGCCAGGAAACCCACTGGGTGCCCG
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:,FFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFF:FFF:FFFFFFFFFFFFFFFFFFFFFFFF,FFFFF:,F:FFFFFFF
@A00627:308:H227VDSX3:1:1447:29315:13745 2:N:0:TGGCAGTA+GTACAGTG
CCCAGGAGCACCAGGAAGGGCAAGAGCACCCTGGCCTAGGGGATCATCTGGCCCAGGGTAGGGTAGGAACAGCCTCATGGTCTTCAGAGTTTGCCCCTTCCTGAGGGAAAGACATTTTAATATTTTTGGGTTGGCTGGACCAATCTCATT
+

我想保留以 2 @ 和下一行开头的这些行。像这样:

    @A00627:308:H227VDSX3:1:1201:30734:26349 2:N:0:TGGCAGTA+GTACAGTG
    CCCAGGAGCACCAGGAAGGGCAAGAGCACCCTGGCCTAGGGGATCATCTGGCCCAGGGTAGGGTAGGAACAGCCTCATGGTCTTCAGAGTTTGCCCCTTCCTGAGGGAAAGACATTTTAATATTTTTGGGTTGGCTGGACCAATCTCATT
   
    
    @A00627:308:H227VDSX3:1:1257:18828:34695 2:N:0:TGGCAGTA+GTACAGTG
    CTGGCCTAGGGGATCATCTGGCCCAGGGTAGGGTAGGAACAGCCTCATGGTCTTCAGAGTTTGCCCCTTCCTGAGGGAAAGACATTTTAATATTTTTGGGTTGGCTGGACCAATCTCATTAAGAGAAGAGAAGAAACGCCCACGCCAGGA
    
    
    @A00627:308:H227VDSX3:1:1266:28809:10300 2:N:0:TGGCAGTA+GTACAGTG
    CTGGCCCAGGGTAGGGTAGGAACAGCCTCATGGTCTTCAGAGTTTGCCCCTTCCTGAGGGAAAGACATTTTAATATTTTTGGGTTGGCTGGACCAATCTCATTAAGAGAAGAGAAGAAACGCCCACGCCAGGAAACCCACTGGGTGCCCG
    
    
    @A00627:308:H227VDSX3:1:1447:29315:13745 2:N:0:TGGCAGTA+GTACAGTG
    CCCAGGAGCACCAGGAAGGGCAAGAGCACCCTGGCCTAGGGGATCATCTGGCCCAGGGTAGGGTAGGAACAGCCTCATGGTCTTCAGAGTTTGCCCCTTCCTGAGGGAAAGACATTTTAATATTTTTGGGTTGGCTGGACCAATCTCATT

我试过这个代码:

import fileinput
from collections import deque
output_file = 'cola1_fasta.txt' 
buscado = '@'

contexto = deque([], 3)  # for keeping the last 4 lines


with open(output_file, "w") as f_out:
    for line in fileinput.input(files=["cola1.txt"]):
        contexto.append(line)       
        if len(contexto) < 3:      
            continue
        if buscado in contexto[1]:  
            f_out.writelines(contexto) 

但是我可以得到这个。你有什么建议吗?非常感谢!!

逐行遍历输入文件,检查该行是否以@开头,如果是,将该行写入文件,然后将header_row标志设置为True,以此类推我们知道将下一行写入文件的迭代。

input_filename = r"cola1.txt"
output_filename = r"cola1_fasta.txt"

header_row = False
with open(input_filename) as in_f:
    with open(output_filename, "wt") as out_f:
        for line in in_f:
            if line.startswith("@"):
                out_f.write(line)
                header_row = True
            elif header_row:
                out_f.write(line)
                header_row = False
            else:
                out_f.write("\n")

cola1_fasta.txt:

@A00627:308:H227VDSX3:1:1201:30734:26349 2:N:0:TGGCAGTA+GTACAGTG
CCCAGGAGCACCAGGAAGGGCAAGAGCACCCTGGCCTAGGGGATCATCTGGCCCAGGGTAGGGTAGGAACAGCCTCATGGTCTTCAGAGTTTGCCCCTTCCTGAGGGAAAGACATTTTAATATTTTTGGGTTGGCTGGACCAATCTCATT


@A00627:308:H227VDSX3:1:1257:18828:34695 2:N:0:TGGCAGTA+GTACAGTG
CTGGCCTAGGGGATCATCTGGCCCAGGGTAGGGTAGGAACAGCCTCATGGTCTTCAGAGTTTGCCCCTTCCTGAGGGAAAGACATTTTAATATTTTTGGGTTGGCTGGACCAATCTCATTAAGAGAAGAGAAGAAACGCCCACGCCAGGA


@A00627:308:H227VDSX3:1:1266:28809:10300 2:N:0:TGGCAGTA+GTACAGTG
CTGGCCCAGGGTAGGGTAGGAACAGCCTCATGGTCTTCAGAGTTTGCCCCTTCCTGAGGGAAAGACATTTTAATATTTTTGGGTTGGCTGGACCAATCTCATTAAGAGAAGAGAAGAAACGCCCACGCCAGGAAACCCACTGGGTGCCCG


@A00627:308:H227VDSX3:1:1447:29315:13745 2:N:0:TGGCAGTA+GTACAGTG
CCCAGGAGCACCAGGAAGGGCAAGAGCACCCTGGCCTAGGGGATCATCTGGCCCAGGGTAGGGTAGGAACAGCCTCATGGTCTTCAGAGTTTGCCCCTTCCTGAGGGAAAGACATTTTAATATTTTTGGGTTGGCTGGACCAATCTCATT

请注意,此实现会在文本文件底部产生 2 个额外的空行。

利用文件是 Python 中的迭代器这一事实。所以循环文件 lin-by-line,检查该行是否以 @ 开头,然后将该行和下一行(使用 next)写入输出文件:

with open(output_file, 'w') as out_file, open(input_file) as in_file):
    for line in in_file:
        if line.startswith('@'):
            out_file.write(line)
            out_file.write(next(in_file)
        else:
            out_file.write('\n')