只有我的一些物种被转换为 NCBI ID,使用 biopython 将物种转换为 ID
Only some of my species are being converted to NCBI IDs, using biopython to convert species to ID
我有一些代码可以从带有下划线的列表中去除物种名称,并将其转换为适合 NCBI 的格式,然后搜索与该物种名称关联的 ID,但是由于某种原因,这不起作用我的输入文件中的每个条目。我已附上我的代码、输入文件的子集和输出文件的子集。
from Bio import Entrez
import time
Entrez.email = 'fake.email@isp.com'
def get_tax_id(species):
species = species.replace('_', '+').strip()
search = Entrez.esearch(term=species, db='taxonomy', retmode='xml')
record = Entrez.read(search)
return record['IdList']
current_time = time.strftime("%d.%m.%y %H:%M", time.localtime())
output_name = 'test#%s.txt' % current_time
file = open(output_name, "w+")
listoforganisms = [x.split('\t')[0] for x in open("OGTlist.csv").readlines()]
if __name__ == '__main__':
organisms = listoforganisms
for organism in organisms:
taxid = get_tax_id(organism)
stringid = str(taxid)
strippedid = stringid.strip("'[]'")
if len(stringid) <= 2:
file.write('\n' + str(organism) + ',ERROR_no_ID_match')
else:
file.write('\n' + str(organism) + ',' + str(strippedid))
所以这段代码打印一个结果文件,如果转换成功,打印物种名称和 ID,如果不成功,它只是打印一个错误,我的结果文件如下所示:
micromonospora_inyonensis,47866
viola_arvensis,97415
amycolatopsis_albidoflavus,102226
tetragenococcus_koreensis,290335
panaeolus_papilionaceus,330517
geomys_pinetis,100306
vibrio_lutjanus,ERROR_no_ID_match
succiniclasticum_ruminis,40841
microtetraspora_malaysiensis,161358
blarina_carolinensis,183658
amycolatopsis_palatopharyngis,187982
rhodosporidium_toruloides,5286
geobacter_bemidjiensis,225194
acinetobacter_haemolyticus,29430
actinoplanes_tereljensis,571912
phyllostomus_hastatus,9423
phacidium_infestans,66518
dorea_formicigenerans,39486
hoeflea_marina,274592
naemacyclus_minor,64355
methanosaeta_thermophila,2224
pholiota_carbonaria,227966
sphingomonas_faeni,185950
helicobacter_pullorum,35818
solitalea_koreensis,543615
dermacoccus_profundi,322602
pseudomonas_pictorum,86184
actinomadura_livida,79909
leptonycteris_curasoae,55054
psychrobacter_salsus,219741
vibrio_inusitatus,413402
stereum_rameale,ERROR_no_ID_match
photorhabdus_temperata,574560
clitocybe_lignatilis,5634
actinocorallia_glomerata,46203
aspergillus_giganteus,5060
erwinia_amylovora,552
hydrogenoanaerobacterium_saccharovorans,474960
mycobacterium_aichiense,1799
nocardia_pneumoniae,228601
bacillus_pocheonensis,363869
streptomonospora_alba,183763
exobasidium_gracile,190086
phenylobacterium_zucineum,284016
amsonia_tabernaemontana,144544
rattus_fuscipes,10119
jannaschia_rubra,282197
hereroa_rehneltiana,ERROR_no_ID_match
我从中获取物种名称的文件如下所示:
micromonospora_inyonensis 28 DSMZ
viola_arvensis 23 DSMZ
amycolatopsis_albidoflavus 28 DSMZ
tetragenococcus_koreensis 28 DSMZ
panaeolus_papilionaceus 24 DSMZ
geomys_pinetis 36.3 white
vibrio_lutjanus 30 DSMZ
succiniclasticum_ruminis 37 DSMZ
microtetraspora_malaysiensis 28 DSMZ
blarina_carolinensis 36.8 white
amycolatopsis_palatopharyngis 28 DSMZ
rhodosporidium_toruloides 23 DSMZ
geobacter_bemidjiensis 30 DSMZ
acinetobacter_haemolyticus 28 DSMZ
actinoplanes_tereljensis 28 DSMZ
phyllostomus_hastatus 34.7 white
phacidium_infestans 25 DSMZ
dorea_formicigenerans 37 DSMZ
hoeflea_marina 28 DSMZ
naemacyclus_minor 22 DSMZ
methanosaeta_thermophila 58.3333333333 DSMZ
pholiota_carbonaria 25 DSMZ
sphingomonas_faeni 22 DSMZ
helicobacter_pullorum 37 DSMZ
solitalea_koreensis 28 DSMZ
dermacoccus_profundi 28 DSMZ
pseudomonas_pictorum 28 DSMZ
actinomadura_livida 28 DSMZ
leptonycteris_curasoae 35.7 white
psychrobacter_salsus 22 DSMZ
vibrio_inusitatus 28 DSMZ
stereum_rameale 20 DSMZ
photorhabdus_temperata 28.6666666667 DSMZ
clitocybe_lignatilis 25 DSMZ
actinocorallia_glomerata 28 DSMZ
aspergillus_giganteus 24.5 DSMZ
erwinia_amylovora 26.6666666667 DSMZ
hydrogenoanaerobacterium_saccharovorans 37 DSMZ
mycobacterium_aichiense 37 DSMZ
nocardia_pneumoniae 28 DSMZ
bacillus_pocheonensis 30 DSMZ
streptomonospora_alba 28 DSMZ
exobasidium_gracile 20 DSMZ
phenylobacterium_zucineum 30 DSMZ
amsonia_tabernaemontana 23 DSMZ
rattus_fuscipes 37.5 white
jannaschia_rubra 25 DSMZ
hereroa_rehneltiana 23 DSMZ
我的实际输入文件大约有 2000 个条目,答案是物种名称不正确还是 NCBI 上不存在所有物种的 ID 一样简单,有没有人有解决方案来以编程方式解决这个问题?
第一个答案是物种名称不存在。你可以在 ncbi 网站上查看。像这儿:
https://www.ncbi.nlm.nih.gov/search/?term=Stereum+rameale
https://www.ncbi.nlm.nih.gov/search/?term=vibrio_lutjanus
如果您查看其他网站,无论如何,Vibrio lutjanus 似乎都不存在。例如 https://www.arb-silva.de/search/ 或
没有解决这个问题的方法(如果找到分类单元 ID),但您可以仔细检查名称是否正确。分类学很困难,每个人都有不同的名字,而且有很多同义词。您可以使用 api 的分类名称网站,例如 gbif 或全球名称。
[编辑]
如果物种不可用,您还可以检查属的分类单元 ID。这里可以下载NCBI的分类信息:
ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/new_taxdump/
您需要下载 zip 文件,可能还需要文件 rankedlineage.dmp 和 merged.dmp 全球名称网站也可用于属级。不知道来自 BioPython 的 entrez 是否可以查找属级别的 id,也许这也是一个选项。
我有一些代码可以从带有下划线的列表中去除物种名称,并将其转换为适合 NCBI 的格式,然后搜索与该物种名称关联的 ID,但是由于某种原因,这不起作用我的输入文件中的每个条目。我已附上我的代码、输入文件的子集和输出文件的子集。
from Bio import Entrez
import time
Entrez.email = 'fake.email@isp.com'
def get_tax_id(species):
species = species.replace('_', '+').strip()
search = Entrez.esearch(term=species, db='taxonomy', retmode='xml')
record = Entrez.read(search)
return record['IdList']
current_time = time.strftime("%d.%m.%y %H:%M", time.localtime())
output_name = 'test#%s.txt' % current_time
file = open(output_name, "w+")
listoforganisms = [x.split('\t')[0] for x in open("OGTlist.csv").readlines()]
if __name__ == '__main__':
organisms = listoforganisms
for organism in organisms:
taxid = get_tax_id(organism)
stringid = str(taxid)
strippedid = stringid.strip("'[]'")
if len(stringid) <= 2:
file.write('\n' + str(organism) + ',ERROR_no_ID_match')
else:
file.write('\n' + str(organism) + ',' + str(strippedid))
所以这段代码打印一个结果文件,如果转换成功,打印物种名称和 ID,如果不成功,它只是打印一个错误,我的结果文件如下所示:
micromonospora_inyonensis,47866
viola_arvensis,97415
amycolatopsis_albidoflavus,102226
tetragenococcus_koreensis,290335
panaeolus_papilionaceus,330517
geomys_pinetis,100306
vibrio_lutjanus,ERROR_no_ID_match
succiniclasticum_ruminis,40841
microtetraspora_malaysiensis,161358
blarina_carolinensis,183658
amycolatopsis_palatopharyngis,187982
rhodosporidium_toruloides,5286
geobacter_bemidjiensis,225194
acinetobacter_haemolyticus,29430
actinoplanes_tereljensis,571912
phyllostomus_hastatus,9423
phacidium_infestans,66518
dorea_formicigenerans,39486
hoeflea_marina,274592
naemacyclus_minor,64355
methanosaeta_thermophila,2224
pholiota_carbonaria,227966
sphingomonas_faeni,185950
helicobacter_pullorum,35818
solitalea_koreensis,543615
dermacoccus_profundi,322602
pseudomonas_pictorum,86184
actinomadura_livida,79909
leptonycteris_curasoae,55054
psychrobacter_salsus,219741
vibrio_inusitatus,413402
stereum_rameale,ERROR_no_ID_match
photorhabdus_temperata,574560
clitocybe_lignatilis,5634
actinocorallia_glomerata,46203
aspergillus_giganteus,5060
erwinia_amylovora,552
hydrogenoanaerobacterium_saccharovorans,474960
mycobacterium_aichiense,1799
nocardia_pneumoniae,228601
bacillus_pocheonensis,363869
streptomonospora_alba,183763
exobasidium_gracile,190086
phenylobacterium_zucineum,284016
amsonia_tabernaemontana,144544
rattus_fuscipes,10119
jannaschia_rubra,282197
hereroa_rehneltiana,ERROR_no_ID_match
我从中获取物种名称的文件如下所示:
micromonospora_inyonensis 28 DSMZ
viola_arvensis 23 DSMZ
amycolatopsis_albidoflavus 28 DSMZ
tetragenococcus_koreensis 28 DSMZ
panaeolus_papilionaceus 24 DSMZ
geomys_pinetis 36.3 white
vibrio_lutjanus 30 DSMZ
succiniclasticum_ruminis 37 DSMZ
microtetraspora_malaysiensis 28 DSMZ
blarina_carolinensis 36.8 white
amycolatopsis_palatopharyngis 28 DSMZ
rhodosporidium_toruloides 23 DSMZ
geobacter_bemidjiensis 30 DSMZ
acinetobacter_haemolyticus 28 DSMZ
actinoplanes_tereljensis 28 DSMZ
phyllostomus_hastatus 34.7 white
phacidium_infestans 25 DSMZ
dorea_formicigenerans 37 DSMZ
hoeflea_marina 28 DSMZ
naemacyclus_minor 22 DSMZ
methanosaeta_thermophila 58.3333333333 DSMZ
pholiota_carbonaria 25 DSMZ
sphingomonas_faeni 22 DSMZ
helicobacter_pullorum 37 DSMZ
solitalea_koreensis 28 DSMZ
dermacoccus_profundi 28 DSMZ
pseudomonas_pictorum 28 DSMZ
actinomadura_livida 28 DSMZ
leptonycteris_curasoae 35.7 white
psychrobacter_salsus 22 DSMZ
vibrio_inusitatus 28 DSMZ
stereum_rameale 20 DSMZ
photorhabdus_temperata 28.6666666667 DSMZ
clitocybe_lignatilis 25 DSMZ
actinocorallia_glomerata 28 DSMZ
aspergillus_giganteus 24.5 DSMZ
erwinia_amylovora 26.6666666667 DSMZ
hydrogenoanaerobacterium_saccharovorans 37 DSMZ
mycobacterium_aichiense 37 DSMZ
nocardia_pneumoniae 28 DSMZ
bacillus_pocheonensis 30 DSMZ
streptomonospora_alba 28 DSMZ
exobasidium_gracile 20 DSMZ
phenylobacterium_zucineum 30 DSMZ
amsonia_tabernaemontana 23 DSMZ
rattus_fuscipes 37.5 white
jannaschia_rubra 25 DSMZ
hereroa_rehneltiana 23 DSMZ
我的实际输入文件大约有 2000 个条目,答案是物种名称不正确还是 NCBI 上不存在所有物种的 ID 一样简单,有没有人有解决方案来以编程方式解决这个问题?
第一个答案是物种名称不存在。你可以在 ncbi 网站上查看。像这儿: https://www.ncbi.nlm.nih.gov/search/?term=Stereum+rameale
https://www.ncbi.nlm.nih.gov/search/?term=vibrio_lutjanus
如果您查看其他网站,无论如何,Vibrio lutjanus 似乎都不存在。例如 https://www.arb-silva.de/search/ 或
没有解决这个问题的方法(如果找到分类单元 ID),但您可以仔细检查名称是否正确。分类学很困难,每个人都有不同的名字,而且有很多同义词。您可以使用 api 的分类名称网站,例如 gbif 或全球名称。
[编辑]
如果物种不可用,您还可以检查属的分类单元 ID。这里可以下载NCBI的分类信息:
ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/new_taxdump/
您需要下载 zip 文件,可能还需要文件 rankedlineage.dmp 和 merged.dmp 全球名称网站也可用于属级。不知道来自 BioPython 的 entrez 是否可以查找属级别的 id,也许这也是一个选项。