如何使用 python 在条件下获取序列计数（在 fasta 中）？

Question

我有一个 fasta 文件（fasta 是一个文件，其中 header 行以 > 开头，后跟对应于 header 的序列行）。我想获得匹配 TRINITY 的序列的计数和每个 >TRINITY 序列之后以 >K 开头的总序列。我能够获得 >TRINITY 序列的计数，但不确定如何获得相应 >TRINITY 序列组的 >K 的计数。我怎样才能在 python 中完成这项工作？

myfasta.fasta:

>TRINITY_DN12824_c0_g1_i1
TGGTGACCTGAATGGTCACCACGTCCATACAGA
>K00363:119:HTJ23BBXX:1:1212:18730:9403 1:N:0:CGATGTAT
CACTATTACAATTCTGATGTTTTAATTACTGAGACAT
>K00363:119:HTJ23BBXX:1:2228:9678:46223_(reversed) 1:N:0:CGATGTAT
TAGATTTAAAATAGACGCTTCCATAGA
>TRINITY_DN12824_c0_g1_i1
TGGTGACCTGAATGGTCACCACGTCCATACAGA
>K00363:119:HTJ23BBXX:1:1212:18730:9403 1:N:0:CGATGTAT
CACTATTACAATTCTGATGTTTTAATTACTGAGACAT
>TRINITY_DN555_c0_g1_i1
>K00363:119:HTJ23BBXX:1:2228:9658:46188_(reversed) 1:N:0:CGATGTAT
CGATGCTAGATTTAAAATAGACG
>K00363:119:HTJ23BBXX:1:2106:15260:10387_(reversed) 1:N:0:CGATGTAT
TTAAAATAGACGCTTCCATAGAGA

我想要的结果：

reference   reference_counts    Corresponding_K_sequences
>TRINITY_DN12824_c0_g1_i1   2   3
>TRINITY_DN555_c0_g1_i1 1   2

这是我写的代码，它只考虑了 >TRINITY 序列计数，但无法将它扩展到它也计算相应的 >K 序列的位，所以任何帮助将不胜感激。致运行： python code.py myfasta.fasta output.txt

import sys
import os
from Bio import SeqIO
from collections import defaultdict
filename = sys.argv[1]
outfile = sys.argv[2]
dedup_records = defaultdict(list)

for record in SeqIO.parse(filename, "fasta"):
    #print(record)
    #print(record.id)
    if record.id.startswith('TRINITY'):
        #print(record.id)
    # Use the sequence as the key and then have a list of id's as the value
        dedup_records[str(record.seq)].append(record.id)
        #print(dedup_records)
with open(outfile, 'w') as output:
#   # to get the counts of duplicated TRINITY ids (sorted order)
    for seq, ids in sorted(dedup_records.items(), key = lambda t: len(t[1]), reverse=True):
        #output.write("{}   {}\n".format(ids,len(ids)))
        print(ids, len(ids))

Answer 1

您的想法是正确的，但您需要跟踪以 "TRINITY" 开头的最后一个 header 并稍微改变您的结构：

from Bio import SeqIO
from collections import defaultdict

TRIN, d = None, defaultdict(lambda: [0,0])

for r in SeqIO.parse('myfasta.fasta', 'fasta'):
    if r.id.startswith('TRINITY'):
        TRIN = r.id
        d[TRIN][0] += 1
    elif r.id.startswith('K'):
        if TRIN:
            d[TRIN][1] += 1

print('reference\treference_counts\tCorresponding_K_sequences')
for k,v in d.items():
    print('{}\t{}\t{}'.format(k,v[0],v[1]))

输出：

reference   reference_counts    Corresponding_K_sequences
TRINITY_DN12824_c0_g1_i1    2   3
TRINITY_DN555_c0_g1_i1  1   2

如何使用 python 在条件下获取序列计数（在 fasta 中）？

How to get the sequence counts (in fasta) with conditions using python?

python

bioinformatics

fasta

biopython