删除 Python 中的特定换行符

remove specific endline breaks in Python

我有一个很长的 fasta 文件,我需要格式化这些行。我尝试了很多东西,但由于我不太熟悉 python 我无法准确解决。

>seq1
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
>seq2
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

我希望它们看起来像:

>seq1
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
>seq2
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

我试过这个:

a_file = open("file.fasta", "r")
string_without_line_breaks = ""
for line in a_file:
    if line[0:1] == ">":
        continue
    else:
        stripped_line = line.rstrip()
        string_without_line_breaks += stripped_line
a_file.close()
print(string_without_line_breaks)

但结果没有显示“>”行,也没有合并所有其他行。希望你能帮助我。谢谢

由于您使用的是 FASTA 数据,另一种解决方案是使用 dedicated library,在这种情况下,您需要的是单线:

from Bio import SeqIO

SeqIO.write(SeqIO.parse('file.fasta', 'fasta'), sys.stdout, 'fasta-2line')

使用 'fasta-2line' 格式描述告诉 SeqIO.write 省略序列内的换行符。

我已经编辑了你代码中的一些错误。

a_file = open("file.fasta", "r")
string_without_line_breaks = ""
needed_lines = []
for line in a_file:
    if line.strip().startswith(">") or line.strip() == "":
        # If there was any lines appended before, commit it.
        if string_without_line_breaks != "":
            needed_lines.append(string_without_line_breaks)
            string_without_line_breaks = ""
        needed_lines.append(line)
        continue
    else:
        stripped_line = line.strip()
        string_without_line_breaks += stripped_line
a_file.close()
print("\n".join(needed_lines))

一个常见的安排是删除换行符,然后在看到下一条记录时将其添加回去。

# Use a context manager (with statement)
with open("file.fasta", "r") as a_file:
    # Keep track of whether we have written something without a newline
    written_lines = False
    for line in a_file:
        # Use standard .startswith()
        if line.startswith(">"):
            if written_lines:
                print()
                written_lines = False
            print(line, end='')
        else:
            print(line.rstrip('\n'), end='')
            written_lines = True
    if written_lines:
        print()

一个常见的初学者错误是在循环结束后忘记添加最后的换行符。

这只是一次打印一行,return 什么都不打印。可能更好的设计是一次收集和 yield 一个 FASTA 记录(header + 序列),可能作为 object。并让来电者决定如何处理它;但是,您可能想使用现有的库来执行此操作 - BioPython 似乎是生物信息学的 go-to 解决方案。

请确保将包含右括号 (>) 的行添加到您的字符串中。

a_file = open("file.fasta", "r")
string_without_line_breaks = ""
for line in a_file:
    if line[0:1] == ">":
        string_without_line_breaks += "\n" + line
        continue
    else:
        stripped_line = line.rstrip()
        string_without_line_breaks += stripped_line
a_file.close()
print(string_without_line_breaks)

顺便说一句,你可以把它变成一个衬垫:

import re

with open("file.fasta", 'r') as f:
    data = f.read()

result = re.sub(r"^(?!>)(.*)$\n(?!>)", r"", data, flags=re.MULTILINE)

print(result)

正则表达式包含一个否定前瞻,以防止修剪以 > 开头的行,并防止修剪正好在 >

之前的行

首先是通常的免责声明:尽可能使用 with 块对文件进行操作。否则它们不会因错误而关闭。

观察到您要删除不是以 > 开头的每一行的换行符,每个块的最后一个除外。您可以通过在不以 > 开头的每一行之后去除换行符来实现相同的效果,并在除第一行之外的以 > 开头的每一行前面添加一个换行符。

out = sys.stdout
with open(..., 'r') as file:
    first = True
    hasline = False
    for line in file:
        if line.startswith('>'):
            if not first:
                out.write('\n')
            out.write(line)
            first = False
        else:
            out.write(line.rstrip())
            hasline = True
    if hasline:
        out.write('\n')

在这种情况下,随手打印比累积字符串简单得多。当您只是转录行时,使用 write 方法打印到文件比使用 print 更简单。