打印 fasta 文件中的序列频率 (python)

Question

我正在尝试从大型 fasta 文件中找出序列群体的多样性。最终目标是创建分布的直方图。

我写了下面的代码来计算每个序列在 fasta 文件中出现的次数。我这样做是为了将计数添加到 id 的末尾。而不是这种格式，我想打印一个输出文件，简单地说一个序列出现 x 次。 y 序列出现 z 次，依此类推，没有序列和 id。

from Bio import SeqIO
from collections import defaultdict

dedup_records = defaultdict(list)
for record in SeqIO.parse("filename.fasta", "fasta"):
    # Use the sequence as the key and then have a list of id's as the value
    dedup_records[str(record.seq)].append(record.id)
with open("filename_output.fasta", 'w') as output:
    for seq, ids in sorted(dedup_records.items(), key=lambda t: len(t[1]), reverse=True):
        output.write(">{}_counts{}\n".format(ids[0], len(ids)))
        output.write(seq + "\n")

The image shows a snippet of the output file

从这张图片我想打印输出： 1 个序列出现 1885 次 1个序列出现1099次 1 个序列出现 280 次。

此外，当多个序列出现相同次数时，它们会分别打印出来。我不确定如何组合这些。EX

如果您有任何建议，请告诉我。非常感谢。

Answer 1

collections.Counter()

使用计数器，两次。像这样：

from Bio import SeqIO
from collections import Counter

# counts the number of times each sequence occurs
sequences = SeqIO.parse("filename.fasta", "fasta")
seq_counts = Counter(str(record.seq) for record in sequences)

# counts how many repeat 2, 3, 4, ... times
count_repeats = Counter(seq_counts.values())

with open("filename_output.fasta", 'w') as output:
    for repeat, num_seqs in count_repeats.most_common():
        output.write(f">{num_seqs} sequences occur {repeat} times\n")

打印 fasta 文件中的序列频率 (python)

Printing the frequency of sequences in a fasta file (python)

python

sorting

sequence

fasta

biopython

collections.Counter()