计算 seqID 中的特定模式 python

Count specific pattern in seqID python

我实际上有一个巨大的 multifasta seq 文件,例如:

>Seq_1_0035_0035
ATTGGAT
>Seq_2_0042_0035
ATTGAGGA
>EOGWX56TR_0035_0042 (busco)
ATGGAGAT
>EOGWX56TR_0042_0042 (busco)
ATGGATGG
>Seq6_035_0042
ATGGGAATAG
>EOG55FTG_0035_0042 (busco)
AATGGATA
>EOG5GFFTA_0042_0042 (busco)
ATGGAGATA
>Seq56_0035_0042
ATGGAGATAT
>EOGATTT_0035_0042  (busco)
AAATGAGATA
>EOGATTT_0042_0042  (busco)
ATGGAAT
>EOGATTA_0042_0042  (busco)
ATAGGAGAT

我实际上想计算我的文件中有多少 Busco 基因(它们都以名称 >EOG 开头)为此我有一个脚本:

count=1
for record in SeqIO.parse("concatenate_with_busco_names_0035_0042_aa.fa", "fasta"):
    count+=1
print(count)

set_of_labels = set()

with open("concatenate_with_busco_names_0035_0042_aa.fa") as f:
  for line in f:
    if line.startswith('>EOG'):
      label = line[4:].split('_')[0]
      set_of_labels.add(label)

print("Total number of Busco genes: " + str(len(set_of_labels)))

但是我还想知道每个对应的部分之间有多少个基因。我解释得更好;

正如你所看到的,每个 seqID 中有两个数字 such _number_number 这些数字是特定的,第一个 _number 对应于序列所属的物种,第二个 _number 是一个特定的数字。 无论如何,我想是否有可能像我一样计算我为 seq 获得的第一个数字 _0035_0042 有多少不同的 Busco 基因 并且 seq ID 有多少:

_0035_0042
_0035_0042
_0042_0042
_0042_0035

在上面的例子中它将是:

Total busco: 5 (I count only once if the >busco is present even if _number are different)
Total busco for the specie _0035 (_0035_0042 and _0035_0035) : 3
Total busco for the specie _0042 (_0042_0042 and _0042_0035) : 4
Total busco for the specific specie  _0035_0042 : 3
Total busco for the specific specie  _0042_0035 : 0
Total busco for the specific specie  _0042_0042 : 4
Total busco for the specific specie  _0035_0035 : 0

你好希望说清楚,其实第一部分(total busco:)已经用我的脚本完成了,我只需要数其他7种方式。

这里是真实数据data

busco[ 外,您还可以使用多个计数器获取 物种 特定物种 的单独计数=39=] 计数器,例如:

import collections

busco = collections.defaultdict(int)  # busco counter
species = collections.defaultdict(int)  # species counter
specific_species = collections.defaultdict(int)  # specific species counter

with open("concatenate_with_busco_names_0035_0042_aa.fa", "r") as f:
    for line in f:
        if line[:4] == ">EOG":
            entry = line.split()[0][4:].split('_')
            busco[entry[0]] += 1
            species[entry[1]] += 1
            specific_species[entry[1] + "_" + entry[2]] += 1

print("Total busco: {}".format(len(busco)))
for specie, total in species.items():
    print("Total busco for the specie {}: {}".format(specie, total))
for specie, total in specific_species.items():
    print("Total busco for the specific specie {}: {}".format(specie, total))

应该产生:

Total busco: 5
Total busco for the specie 0035: 3
Total busco for the specie 0042: 4
Total busco for the specific specie 0035_0042: 3
Total busco for the specific specie 0042_0042: 4

未列出的(特定)物种不会出现,但如果你真的想打印出来,你可以从 species 计数器中组合它们并打印它们的值(默认为 0 ):

import itertools

print("Total busco: {}".format(len(busco)))
for specie, total in species.items():
    print("Total busco for the specie {}: {}".format(specie, total))
for specie in itertools.product(species, species):
    s = "_".join(specie)
    print("Total busco for the specific specie {}: {}".format(s, specific_species[s]))

产生:

Total busco: 5
Total busco for the specie 0035: 3
Total busco for the specie 0042: 4
Total busco for the specific specie 0035_0035: 0
Total busco for the specific specie 0035_0042: 3
Total busco for the specific specie 0042_0035: 0
Total busco for the specific specie 0042_0042: 4

UPDATE:如果您想要每个 busco 的唯一计数,那么您需要反转计数以在 [=31= 上建立索引]specie / specific specie 并在 set 中收集 busco 值作为它们的值。然后你所需要的就是得到每组的长度,比如:

import collections
import itertools

busco = set()
species = collections.defaultdict(set)
specific_species = collections.defaultdict(set)

with open("concatenate_with_busco_names_0035_0042_aa.fa", "r") as f:
    for line in f:
        if line[:4] == ">EOG":
            entry = line.split()[0][4:].split('_')
            busco.add(entry[0])
            species[entry[1]].add(entry[0])
            specific_species[entry[1] + "_" + entry[2]].add(entry[0])

print("Total busco: {}".format(len(busco)))
for specie, buscos in species.items():
    print("Total busco for the specie {}: {}".format(specie, len(buscos)))
for specie in itertools.product(species, species):
    s = "_".join(specie)
    print("Total busco for the specific specie {}: {}".format(s, len(specific_species[s])))

您的完整数据打印:

Total busco: 421
Total busco for the specie 0035: 402
Total busco for the specie 0042: 397
Total busco for the specific specie 0035_0035: 392
Total busco for the specific specie 0035_0042: 262
Total busco for the specific specie 0042_0035: 305
Total busco for the specific specie 0042_0042: 383

使用 Python 标准库中的 Counter class 是微不足道的:

from collections import Counter
from io import StringIO

label_counter = Counter()
specy_counter = Counter()
specific_specy_counter = Counter()

# replace this with an open() on your real file 
finput = StringIO(""">Seq_1_0035_0035
ATTGGAT
>Seq_2_0042_0035
ATTGAGGA
>EOGWX56TR_0035_0042 (busco)
ATGGAGAT
>EOGWX56TR_0042_0042 (busco)
ATGGATGG
>Seq6_035_0042
ATGGGAATAG
>EOG55FTG_0035_0042 (busco)
AATGGATA
>EOG5GFFTA_0042_0042 (busco)
ATGGAGATA
>Seq56_0035_0042
ATGGAGATAT
>EOGATTT_0035_0042  (busco)
AAATGAGATA
>EOGATTT_0042_0042  (busco)
ATGGAAT
>EOGATTA_0042_0042  (busco)
ATAGGAGAT""")



for line in finput:
    try:
        if line.startswith('>EOG'):
            label, specy, specific = line[4:].replace(" (busco)", "").strip().split('_')
            label_counter[label] += 1
            specy_counter[specy] += 1
            specific_specy_counter[(specy, specific)] += 1
    except ValueError:
        print("Invalid line:", line)


print("Total busco:", len(label_counter))
for specy, count in specy_counter.items():
    print("Total busco for the specie {} : {}".format(specy, count))
for (specy, specific), count in specific_specy_counter.items():
    print("Total busco for the specific specy {}_{} : {}".format(specy, specific, count))

请注意,不会显示值为 0 的物种或细节。