计算 seqID 中的特定模式 python
Count specific pattern in seqID python
我实际上有一个巨大的 multifasta seq 文件,例如:
>Seq_1_0035_0035
ATTGGAT
>Seq_2_0042_0035
ATTGAGGA
>EOGWX56TR_0035_0042 (busco)
ATGGAGAT
>EOGWX56TR_0042_0042 (busco)
ATGGATGG
>Seq6_035_0042
ATGGGAATAG
>EOG55FTG_0035_0042 (busco)
AATGGATA
>EOG5GFFTA_0042_0042 (busco)
ATGGAGATA
>Seq56_0035_0042
ATGGAGATAT
>EOGATTT_0035_0042 (busco)
AAATGAGATA
>EOGATTT_0042_0042 (busco)
ATGGAAT
>EOGATTA_0042_0042 (busco)
ATAGGAGAT
我实际上想计算我的文件中有多少 Busco 基因(它们都以名称 >EOG
开头)为此我有一个脚本:
count=1
for record in SeqIO.parse("concatenate_with_busco_names_0035_0042_aa.fa", "fasta"):
count+=1
print(count)
set_of_labels = set()
with open("concatenate_with_busco_names_0035_0042_aa.fa") as f:
for line in f:
if line.startswith('>EOG'):
label = line[4:].split('_')[0]
set_of_labels.add(label)
print("Total number of Busco genes: " + str(len(set_of_labels)))
但是我还想知道每个对应的部分之间有多少个基因。我解释得更好;
正如你所看到的,每个 seqID 中有两个数字 such _number_number
这些数字是特定的,第一个 _number
对应于序列所属的物种,第二个 _number
是一个特定的数字。
无论如何,我想是否有可能像我一样计算我为 seq 获得的第一个数字 _0035
和 _0042
有多少不同的 Busco 基因
并且
seq ID 有多少:
_0035_0042
_0035_0042
_0042_0042
_0042_0035
在上面的例子中它将是:
Total busco: 5 (I count only once if the >busco is present even if _number are different)
Total busco for the specie _0035 (_0035_0042 and _0035_0035) : 3
Total busco for the specie _0042 (_0042_0042 and _0042_0035) : 4
Total busco for the specific specie _0035_0042 : 3
Total busco for the specific specie _0042_0035 : 0
Total busco for the specific specie _0042_0042 : 4
Total busco for the specific specie _0035_0035 : 0
你好希望说清楚,其实第一部分(total busco:
)已经用我的脚本完成了,我只需要数其他7种方式。
这里是真实数据data
除 busco[ 外,您还可以使用多个计数器获取 物种 和 特定物种 的单独计数=39=] 计数器,例如:
import collections
busco = collections.defaultdict(int) # busco counter
species = collections.defaultdict(int) # species counter
specific_species = collections.defaultdict(int) # specific species counter
with open("concatenate_with_busco_names_0035_0042_aa.fa", "r") as f:
for line in f:
if line[:4] == ">EOG":
entry = line.split()[0][4:].split('_')
busco[entry[0]] += 1
species[entry[1]] += 1
specific_species[entry[1] + "_" + entry[2]] += 1
print("Total busco: {}".format(len(busco)))
for specie, total in species.items():
print("Total busco for the specie {}: {}".format(specie, total))
for specie, total in specific_species.items():
print("Total busco for the specific specie {}: {}".format(specie, total))
应该产生:
Total busco: 5
Total busco for the specie 0035: 3
Total busco for the specie 0042: 4
Total busco for the specific specie 0035_0042: 3
Total busco for the specific specie 0042_0042: 4
未列出的(特定)物种不会出现,但如果你真的想打印出来,你可以从 species
计数器中组合它们并打印它们的值(默认为 0
):
import itertools
print("Total busco: {}".format(len(busco)))
for specie, total in species.items():
print("Total busco for the specie {}: {}".format(specie, total))
for specie in itertools.product(species, species):
s = "_".join(specie)
print("Total busco for the specific specie {}: {}".format(s, specific_species[s]))
产生:
Total busco: 5
Total busco for the specie 0035: 3
Total busco for the specie 0042: 4
Total busco for the specific specie 0035_0035: 0
Total busco for the specific specie 0035_0042: 3
Total busco for the specific specie 0042_0035: 0
Total busco for the specific specie 0042_0042: 4
UPDATE:如果您想要每个 busco 的唯一计数,那么您需要反转计数以在 [=31= 上建立索引]specie / specific specie 并在 set
中收集 busco 值作为它们的值。然后你所需要的就是得到每组的长度,比如:
import collections
import itertools
busco = set()
species = collections.defaultdict(set)
specific_species = collections.defaultdict(set)
with open("concatenate_with_busco_names_0035_0042_aa.fa", "r") as f:
for line in f:
if line[:4] == ">EOG":
entry = line.split()[0][4:].split('_')
busco.add(entry[0])
species[entry[1]].add(entry[0])
specific_species[entry[1] + "_" + entry[2]].add(entry[0])
print("Total busco: {}".format(len(busco)))
for specie, buscos in species.items():
print("Total busco for the specie {}: {}".format(specie, len(buscos)))
for specie in itertools.product(species, species):
s = "_".join(specie)
print("Total busco for the specific specie {}: {}".format(s, len(specific_species[s])))
您的完整数据打印:
Total busco: 421
Total busco for the specie 0035: 402
Total busco for the specie 0042: 397
Total busco for the specific specie 0035_0035: 392
Total busco for the specific specie 0035_0042: 262
Total busco for the specific specie 0042_0035: 305
Total busco for the specific specie 0042_0042: 383
使用 Python 标准库中的 Counter
class 是微不足道的:
from collections import Counter
from io import StringIO
label_counter = Counter()
specy_counter = Counter()
specific_specy_counter = Counter()
# replace this with an open() on your real file
finput = StringIO(""">Seq_1_0035_0035
ATTGGAT
>Seq_2_0042_0035
ATTGAGGA
>EOGWX56TR_0035_0042 (busco)
ATGGAGAT
>EOGWX56TR_0042_0042 (busco)
ATGGATGG
>Seq6_035_0042
ATGGGAATAG
>EOG55FTG_0035_0042 (busco)
AATGGATA
>EOG5GFFTA_0042_0042 (busco)
ATGGAGATA
>Seq56_0035_0042
ATGGAGATAT
>EOGATTT_0035_0042 (busco)
AAATGAGATA
>EOGATTT_0042_0042 (busco)
ATGGAAT
>EOGATTA_0042_0042 (busco)
ATAGGAGAT""")
for line in finput:
try:
if line.startswith('>EOG'):
label, specy, specific = line[4:].replace(" (busco)", "").strip().split('_')
label_counter[label] += 1
specy_counter[specy] += 1
specific_specy_counter[(specy, specific)] += 1
except ValueError:
print("Invalid line:", line)
print("Total busco:", len(label_counter))
for specy, count in specy_counter.items():
print("Total busco for the specie {} : {}".format(specy, count))
for (specy, specific), count in specific_specy_counter.items():
print("Total busco for the specific specy {}_{} : {}".format(specy, specific, count))
请注意,不会显示值为 0 的物种或细节。
我实际上有一个巨大的 multifasta seq 文件,例如:
>Seq_1_0035_0035
ATTGGAT
>Seq_2_0042_0035
ATTGAGGA
>EOGWX56TR_0035_0042 (busco)
ATGGAGAT
>EOGWX56TR_0042_0042 (busco)
ATGGATGG
>Seq6_035_0042
ATGGGAATAG
>EOG55FTG_0035_0042 (busco)
AATGGATA
>EOG5GFFTA_0042_0042 (busco)
ATGGAGATA
>Seq56_0035_0042
ATGGAGATAT
>EOGATTT_0035_0042 (busco)
AAATGAGATA
>EOGATTT_0042_0042 (busco)
ATGGAAT
>EOGATTA_0042_0042 (busco)
ATAGGAGAT
我实际上想计算我的文件中有多少 Busco 基因(它们都以名称 >EOG
开头)为此我有一个脚本:
count=1
for record in SeqIO.parse("concatenate_with_busco_names_0035_0042_aa.fa", "fasta"):
count+=1
print(count)
set_of_labels = set()
with open("concatenate_with_busco_names_0035_0042_aa.fa") as f:
for line in f:
if line.startswith('>EOG'):
label = line[4:].split('_')[0]
set_of_labels.add(label)
print("Total number of Busco genes: " + str(len(set_of_labels)))
但是我还想知道每个对应的部分之间有多少个基因。我解释得更好;
正如你所看到的,每个 seqID 中有两个数字 such _number_number
这些数字是特定的,第一个 _number
对应于序列所属的物种,第二个 _number
是一个特定的数字。
无论如何,我想是否有可能像我一样计算我为 seq 获得的第一个数字 _0035
和 _0042
有多少不同的 Busco 基因
并且
seq ID 有多少:
_0035_0042
_0035_0042
_0042_0042
_0042_0035
在上面的例子中它将是:
Total busco: 5 (I count only once if the >busco is present even if _number are different)
Total busco for the specie _0035 (_0035_0042 and _0035_0035) : 3
Total busco for the specie _0042 (_0042_0042 and _0042_0035) : 4
Total busco for the specific specie _0035_0042 : 3
Total busco for the specific specie _0042_0035 : 0
Total busco for the specific specie _0042_0042 : 4
Total busco for the specific specie _0035_0035 : 0
你好希望说清楚,其实第一部分(total busco:
)已经用我的脚本完成了,我只需要数其他7种方式。
这里是真实数据data
除 busco[ 外,您还可以使用多个计数器获取 物种 和 特定物种 的单独计数=39=] 计数器,例如:
import collections
busco = collections.defaultdict(int) # busco counter
species = collections.defaultdict(int) # species counter
specific_species = collections.defaultdict(int) # specific species counter
with open("concatenate_with_busco_names_0035_0042_aa.fa", "r") as f:
for line in f:
if line[:4] == ">EOG":
entry = line.split()[0][4:].split('_')
busco[entry[0]] += 1
species[entry[1]] += 1
specific_species[entry[1] + "_" + entry[2]] += 1
print("Total busco: {}".format(len(busco)))
for specie, total in species.items():
print("Total busco for the specie {}: {}".format(specie, total))
for specie, total in specific_species.items():
print("Total busco for the specific specie {}: {}".format(specie, total))
应该产生:
Total busco: 5 Total busco for the specie 0035: 3 Total busco for the specie 0042: 4 Total busco for the specific specie 0035_0042: 3 Total busco for the specific specie 0042_0042: 4
未列出的(特定)物种不会出现,但如果你真的想打印出来,你可以从 species
计数器中组合它们并打印它们的值(默认为 0
):
import itertools
print("Total busco: {}".format(len(busco)))
for specie, total in species.items():
print("Total busco for the specie {}: {}".format(specie, total))
for specie in itertools.product(species, species):
s = "_".join(specie)
print("Total busco for the specific specie {}: {}".format(s, specific_species[s]))
产生:
Total busco: 5 Total busco for the specie 0035: 3 Total busco for the specie 0042: 4 Total busco for the specific specie 0035_0035: 0 Total busco for the specific specie 0035_0042: 3 Total busco for the specific specie 0042_0035: 0 Total busco for the specific specie 0042_0042: 4
UPDATE:如果您想要每个 busco 的唯一计数,那么您需要反转计数以在 [=31= 上建立索引]specie / specific specie 并在 set
中收集 busco 值作为它们的值。然后你所需要的就是得到每组的长度,比如:
import collections
import itertools
busco = set()
species = collections.defaultdict(set)
specific_species = collections.defaultdict(set)
with open("concatenate_with_busco_names_0035_0042_aa.fa", "r") as f:
for line in f:
if line[:4] == ">EOG":
entry = line.split()[0][4:].split('_')
busco.add(entry[0])
species[entry[1]].add(entry[0])
specific_species[entry[1] + "_" + entry[2]].add(entry[0])
print("Total busco: {}".format(len(busco)))
for specie, buscos in species.items():
print("Total busco for the specie {}: {}".format(specie, len(buscos)))
for specie in itertools.product(species, species):
s = "_".join(specie)
print("Total busco for the specific specie {}: {}".format(s, len(specific_species[s])))
您的完整数据打印:
Total busco: 421 Total busco for the specie 0035: 402 Total busco for the specie 0042: 397 Total busco for the specific specie 0035_0035: 392 Total busco for the specific specie 0035_0042: 262 Total busco for the specific specie 0042_0035: 305 Total busco for the specific specie 0042_0042: 383
使用 Python 标准库中的 Counter
class 是微不足道的:
from collections import Counter
from io import StringIO
label_counter = Counter()
specy_counter = Counter()
specific_specy_counter = Counter()
# replace this with an open() on your real file
finput = StringIO(""">Seq_1_0035_0035
ATTGGAT
>Seq_2_0042_0035
ATTGAGGA
>EOGWX56TR_0035_0042 (busco)
ATGGAGAT
>EOGWX56TR_0042_0042 (busco)
ATGGATGG
>Seq6_035_0042
ATGGGAATAG
>EOG55FTG_0035_0042 (busco)
AATGGATA
>EOG5GFFTA_0042_0042 (busco)
ATGGAGATA
>Seq56_0035_0042
ATGGAGATAT
>EOGATTT_0035_0042 (busco)
AAATGAGATA
>EOGATTT_0042_0042 (busco)
ATGGAAT
>EOGATTA_0042_0042 (busco)
ATAGGAGAT""")
for line in finput:
try:
if line.startswith('>EOG'):
label, specy, specific = line[4:].replace(" (busco)", "").strip().split('_')
label_counter[label] += 1
specy_counter[specy] += 1
specific_specy_counter[(specy, specific)] += 1
except ValueError:
print("Invalid line:", line)
print("Total busco:", len(label_counter))
for specy, count in specy_counter.items():
print("Total busco for the specie {} : {}".format(specy, count))
for (specy, specific), count in specific_specy_counter.items():
print("Total busco for the specific specy {}_{} : {}".format(specy, specific, count))
请注意,不会显示值为 0 的物种或细节。