删除 Python 中的特定换行符
remove specific endline breaks in Python
我有一个很长的 fasta 文件,我需要格式化这些行。我尝试了很多东西,但由于我不太熟悉 python 我无法准确解决。
>seq1
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
>seq2
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
我希望它们看起来像:
>seq1
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
>seq2
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
我试过这个:
a_file = open("file.fasta", "r")
string_without_line_breaks = ""
for line in a_file:
if line[0:1] == ">":
continue
else:
stripped_line = line.rstrip()
string_without_line_breaks += stripped_line
a_file.close()
print(string_without_line_breaks)
但结果没有显示“>”行,也没有合并所有其他行。希望你能帮助我。谢谢
由于您使用的是 FASTA 数据,另一种解决方案是使用 dedicated library,在这种情况下,您需要的是单线:
from Bio import SeqIO
SeqIO.write(SeqIO.parse('file.fasta', 'fasta'), sys.stdout, 'fasta-2line')
使用 'fasta-2line'
格式描述告诉 SeqIO.write
省略序列内的换行符。
我已经编辑了你代码中的一些错误。
a_file = open("file.fasta", "r")
string_without_line_breaks = ""
needed_lines = []
for line in a_file:
if line.strip().startswith(">") or line.strip() == "":
# If there was any lines appended before, commit it.
if string_without_line_breaks != "":
needed_lines.append(string_without_line_breaks)
string_without_line_breaks = ""
needed_lines.append(line)
continue
else:
stripped_line = line.strip()
string_without_line_breaks += stripped_line
a_file.close()
print("\n".join(needed_lines))
一个常见的安排是删除换行符,然后在看到下一条记录时将其添加回去。
# Use a context manager (with statement)
with open("file.fasta", "r") as a_file:
# Keep track of whether we have written something without a newline
written_lines = False
for line in a_file:
# Use standard .startswith()
if line.startswith(">"):
if written_lines:
print()
written_lines = False
print(line, end='')
else:
print(line.rstrip('\n'), end='')
written_lines = True
if written_lines:
print()
一个常见的初学者错误是在循环结束后忘记添加最后的换行符。
这只是一次打印一行,return 什么都不打印。可能更好的设计是一次收集和 yield
一个 FASTA 记录(header + 序列),可能作为 object。并让来电者决定如何处理它;但是,您可能想使用现有的库来执行此操作 - BioPython 似乎是生物信息学的 go-to 解决方案。
请确保将包含右括号 (>
) 的行添加到您的字符串中。
a_file = open("file.fasta", "r")
string_without_line_breaks = ""
for line in a_file:
if line[0:1] == ">":
string_without_line_breaks += "\n" + line
continue
else:
stripped_line = line.rstrip()
string_without_line_breaks += stripped_line
a_file.close()
print(string_without_line_breaks)
顺便说一句,你可以把它变成一个衬垫:
import re
with open("file.fasta", 'r') as f:
data = f.read()
result = re.sub(r"^(?!>)(.*)$\n(?!>)", r"", data, flags=re.MULTILINE)
print(result)
正则表达式包含一个否定前瞻,以防止修剪以 >
开头的行,并防止修剪正好在 >
首先是通常的免责声明:尽可能使用 with
块对文件进行操作。否则它们不会因错误而关闭。
观察到您要删除不是以 >
开头的每一行的换行符,每个块的最后一个除外。您可以通过在不以 >
开头的每一行之后去除换行符来实现相同的效果,并在除第一行之外的以 >
开头的每一行前面添加一个换行符。
out = sys.stdout
with open(..., 'r') as file:
first = True
hasline = False
for line in file:
if line.startswith('>'):
if not first:
out.write('\n')
out.write(line)
first = False
else:
out.write(line.rstrip())
hasline = True
if hasline:
out.write('\n')
在这种情况下,随手打印比累积字符串简单得多。当您只是转录行时,使用 write
方法打印到文件比使用 print
更简单。