Trim 使用 BioPython 的 fasta 文件

Trim fasta files using BioPython

我有一个包含多个序列的 fasta 文件。一些序列以“-”结尾,我想从最终序列中 trim 它们。有没有一种干净的方法来 trim 它们并使用 Biopython 编写一个没有破折号的新 fasta 文件?

我看到了这个 post How to remove all-N sequence entries from fasta file(s) 并尝试修改一些代码但没有成功...

包含如下序列的文件:

sequence_of_interest CAGGCCATTTCACCTAGAACTTTAAATGCATGGGTAAAAGTAGTAGAAGAGAAGGCTTTTAGCCCAGAAGTAATACCCATGTTTTCAGCATTATCAGAAGGAGCCACCCCACAAGATTTAAACACCATGCTAAACACAGTGGGGGGACATCAAGCAGCAATGCAAATGTTAAAAGAGACCATCAATGAGGAAGCTGCAGAATGGGATAGATTGCATCCAGTGCACGCAGGGCCTATTGCACCAGGCCAGATGAGAGAA---------------------------------------------------------------

def dash_removal(file_in, file_out):
    records = SeqIO.parse(file_in, 'fasta')
    filtered = (rec for rec in records if any(ch != '-' for ch in rec.seq))
    SeqIO.write(filtered, file_out, 'fasta')
    dash_removal("dash_removal_test.fasta", "dashes_gone?.fasta")

所有序列最终都应该trim看起来像这样:

sequence_of_interest CAGGCCATTTCACCTAGAACTTTAAATGCATGGGTAAAAGTAGTAGAAGAGAAGGCTTTTAGCCCAGAAGTAATACCCATGTTTTCAGCATTATCAGAAGGAGCCACCCCACAAGATTTAAACACCATGCTAAACACAGTGGGGGGACATCAAGCAGCAATGCAAATGTTAAAAGAGACCATCAATGAGGAAGCTGCAGAATGGGATAGATTGCATCCAGTGCACGCAGGGCCTATTGCACCAGGCCAGATGAGAGAA

如有任何帮助,我们将不胜感激!

所有使用 sed 的选项都很棒,因为它们更快,但这里有一种方法可以在 BioPython 中实现。

想法是在每个记录的 seq 属性上使用 rstriprstrip 可以像在 Python.

中的任何其他字符串一样用于序列
from Bio import SeqIO
import io

seq = """>sequence_of_interest
CAGGCCATTTCACCTAGAACTTTAAATGCATGGGTAAAAGTAGTAGAAGAGAAGGCTTTTAGCCCAGAAGTAATACCCAT
GTTTTCAGCATTATCAGAAGGAGCCACCCCACAAGATTTAAACACCATGCTAAACACAGTGGGGGGACATCAAGCAGCAA
TGCAAATGTTAAAAGAGACCATCAATGAGGAAGCTGCAGAATGGGATAGATTGCATCCAGTGCACGCAGGGCCTATTGCA
CCAGGCCAGATGAGAGAA--------------------------------------------------------------"""

f = io.StringIO(seq) # replace it with f = open('my_fasta.fa', 'r')
clean_records = []
for record in SeqIO.parse(f, "fasta"):
    record.seq = record.seq.rstrip('-')
    clean_records.append(record)

with open('clean_fasta.fa', 'w') as f:
    SeqIO.write(clean_records, f, 'fasta')