打印 fasta 文件中的序列频率 (python)
Printing the frequency of sequences in a fasta file (python)
我正在尝试从大型 fasta 文件中找出序列群体的多样性。最终目标是创建分布的直方图。
我写了下面的代码来计算每个序列在 fasta 文件中出现的次数。我这样做是为了将计数添加到 id 的末尾。而不是这种格式,我想打印一个输出文件,简单地说一个序列出现 x 次。 y 序列出现 z 次,依此类推,没有序列和 id。
from Bio import SeqIO
from collections import defaultdict
dedup_records = defaultdict(list)
for record in SeqIO.parse("filename.fasta", "fasta"):
# Use the sequence as the key and then have a list of id's as the value
dedup_records[str(record.seq)].append(record.id)
with open("filename_output.fasta", 'w') as output:
for seq, ids in sorted(dedup_records.items(), key=lambda t: len(t[1]), reverse=True):
output.write(">{}_counts{}\n".format(ids[0], len(ids)))
output.write(seq + "\n")
The image shows a snippet of the output file
从这张图片我想打印输出:
1 个序列出现 1885 次
1个序列出现1099次
1 个序列出现 280 次。
此外,当多个序列出现相同次数时,它们会分别打印出来。我不确定如何组合这些。EX
如果您有任何建议,请告诉我。非常感谢。
collections.Counter()
使用计数器,两次。像这样:
from Bio import SeqIO
from collections import Counter
# counts the number of times each sequence occurs
sequences = SeqIO.parse("filename.fasta", "fasta")
seq_counts = Counter(str(record.seq) for record in sequences)
# counts how many repeat 2, 3, 4, ... times
count_repeats = Counter(seq_counts.values())
with open("filename_output.fasta", 'w') as output:
for repeat, num_seqs in count_repeats.most_common():
output.write(f">{num_seqs} sequences occur {repeat} times\n")
我正在尝试从大型 fasta 文件中找出序列群体的多样性。最终目标是创建分布的直方图。
我写了下面的代码来计算每个序列在 fasta 文件中出现的次数。我这样做是为了将计数添加到 id 的末尾。而不是这种格式,我想打印一个输出文件,简单地说一个序列出现 x 次。 y 序列出现 z 次,依此类推,没有序列和 id。
from Bio import SeqIO
from collections import defaultdict
dedup_records = defaultdict(list)
for record in SeqIO.parse("filename.fasta", "fasta"):
# Use the sequence as the key and then have a list of id's as the value
dedup_records[str(record.seq)].append(record.id)
with open("filename_output.fasta", 'w') as output:
for seq, ids in sorted(dedup_records.items(), key=lambda t: len(t[1]), reverse=True):
output.write(">{}_counts{}\n".format(ids[0], len(ids)))
output.write(seq + "\n")
The image shows a snippet of the output file
从这张图片我想打印输出: 1 个序列出现 1885 次 1个序列出现1099次 1 个序列出现 280 次。
此外,当多个序列出现相同次数时,它们会分别打印出来。我不确定如何组合这些。EX
如果您有任何建议,请告诉我。非常感谢。
collections.Counter()
使用计数器,两次。像这样:
from Bio import SeqIO
from collections import Counter
# counts the number of times each sequence occurs
sequences = SeqIO.parse("filename.fasta", "fasta")
seq_counts = Counter(str(record.seq) for record in sequences)
# counts how many repeat 2, 3, 4, ... times
count_repeats = Counter(seq_counts.values())
with open("filename_output.fasta", 'w') as output:
for repeat, num_seqs in count_repeats.most_common():
output.write(f">{num_seqs} sequences occur {repeat} times\n")