Biopython 从整个基因组中检索特定的 CDS
Biopython retrieving particular CDS from a whole genome
我是 Whosebug 的新手。我正在尝试使用 Biopython 自动执行搜索过程。我有两个列表,一个是蛋白质 GI 编号,另一个是相应的核苷酸 GI 编号。
例如:
protein_GI=[588489721,788136950,409084506]
nucleo_GI=[588489708,788136846,409084493]
第二个列表是使用 ELink 创建的。然而,核苷酸 GI 对应于整个基因组。我需要从每个与蛋白质 GI 匹配的基因组中检索特定的 CDS。
我尝试再次使用具有不同 link 名称("protein_nucleotide_cds"、"protein_nuccore")的 ELink,但我得到的只是整个基因组的 ID 号。我应该试试其他 link 个名字吗?
我还尝试了以下 EFetch 代码:
import Bio
from Bio import Entrez
Entrez.email = None
handle=Entrez.efetch(db="sequences",id="588489708,588489721",rettype="fasta",retmode="text")
print(handle.read())
这种方法在fasta文件中给出了核苷酸和蛋白质序列,但核苷酸序列是整个基因组。
如果有人能帮助我,我将不胜感激。
提前致谢!
希望对你有帮助
import Bio
from Bio import Entrez
from Bio import SeqIO
Entrez.email = "mail@example.com"
gi_protein = "GI:588489721"
gi_genome = "GI:588489708"
handle=Entrez.efetch(db="sequences", id=gi_protein,rettype="fasta", retmode="text")
protein = SeqIO.parse(handle, "fasta").next()
handle=Entrez.efetch(db="sequences", id=gi_genome, rettype="gbwithparts", retmode="text")
genome = SeqIO.parse(handle, "gb").next()
#to extract feature with 'id' equal to protein
feature = [f for f in gb.features if "db_xref" in f.qualifiers and gi_protein in f.qualifiers["db_xref"]]
#to get location of CDS
start = feature[0].location.start.position
end = feature[0].location.end.position
strand = feature[0].location.strand
seq = genome[start: end]
if strand == 1:
print seq.seq
else:
#if strand is -1 then to get reverse complement
print seq.reverse_complement().seq
print protein.seq
然后你得到:
ATGGATTATATTGTTTCAGCACGAAAATATCGTCCCTCTACCTTTGTTTCGGTGGTAGGG
CAGCAGAACATCACCACTACATTAAAAAATGCCATTAAAGGCAGTCAACTGGCACACGCC
TATCTTTTTTGCGGACCGCGAGGTGTGGGAAAGACGACTTGTGCCCGTATCTTTGCTAAA
ACCATCAACTGTTCGAATATATCAGCTGATTTTGAAGCGTGTAATGAGTGTGAATCCTGT
AAGTCTTTTAATGAGAATCGTTCTTATAATATTCATGAACTGGATGGAGCCTCCAATAAC
TCAGTAGAGGATATCAGGAGTCTGATTGATAAAGTTCGTGTTCCACCTCAGATAGGTAGT
TATAGTGTATATATTATCGATGAGGTTCACATGTTATCGCAGGCAGCTTTTAATGCTTTT
CTTAAAACATTGGAAGAGCCACCCAAGCATGCCATCTTTATTTTGGCCACTACTGAAAAA
CATAAAATACTACCAACGATCCTGTCTCGTTGCCAGATTTACGATTTTAATAGGATTACC
ATTGAAGATGCGGTAGGTCATTTAAAATATGTAGCAGAGAGTGAGCATATAACTGTGGAA
GAAGAGGGGTTAACCGTCATTGCACAAAAAGCTGATGGAGCTATGCGGGATGCACTTTCC
ATCTTTGATCAGATTGTGGCTTTCTCAGGTAAAAGTATCAGCTATCAGCAAGTAATCGAT
AATTTGAATGTATTGGATTATGATTTTTACTTTAGGTTGGTGGATGCTTTTCTGGCAGAA
GATACTACACAAACACTATTGATTTTTGATGAGATATTGAAACGGGGATTTGATGCACAT
CATTTTATTTCCGGTTTAAGTTCTCATTTGCGTGATTTACTTGTATGTAAGGATGCAGCC
ACCATTCAGTTGCTGGATGTGGGTGCTAAAATTAAGGAGAAGTACGGTGTTCAGGCGCAA
AAAAGTACGATTGACTTTTTAATGGATGCTTTAAATATTACCAACGATTGCGATTTGCAA
TATAGGGTGGCTAAAAATAAGCGTTTGCATGTGGAGTTTGCTCTTCTTAAGATAGCACGT
GTATTAGATGAACAAAGAAAAAAGTAG
MDYIVSARKYRPSTFVSVVGQQNITTTLKNAIKGSQLAHAYLFCGPRGVGKTTCARIFAK
TINCSNISADFEACNECESCKSFNENRSYNIHELDGASNNSVEDIRSLIDKVRVPPQIGS
YSVYIIDEVHMLSQAAFNAFLKTLEEPPKHAIFILATTEKHKILPTILSRCQIYDFNRIT
IEDAVGHLKYVAESEHITVEEEGLTVIAQKADGAMRDALSIFDQIVAFSGKSISYQQVID
NLNVLDYDFYFRLVDAFLAEDTTQTLLIFDEILKRGFDAHHFISGLSSHLRDLLVCKDAA
TIQLLDVGAKIKEKYGVQAQKSTIDFLMDALNITNDCDLQYRVAKNKRLHVEFALLKIAR
VLDEQRKK
我是 Whosebug 的新手。我正在尝试使用 Biopython 自动执行搜索过程。我有两个列表,一个是蛋白质 GI 编号,另一个是相应的核苷酸 GI 编号。 例如:
protein_GI=[588489721,788136950,409084506]
nucleo_GI=[588489708,788136846,409084493]
第二个列表是使用 ELink 创建的。然而,核苷酸 GI 对应于整个基因组。我需要从每个与蛋白质 GI 匹配的基因组中检索特定的 CDS。 我尝试再次使用具有不同 link 名称("protein_nucleotide_cds"、"protein_nuccore")的 ELink,但我得到的只是整个基因组的 ID 号。我应该试试其他 link 个名字吗? 我还尝试了以下 EFetch 代码:
import Bio
from Bio import Entrez
Entrez.email = None
handle=Entrez.efetch(db="sequences",id="588489708,588489721",rettype="fasta",retmode="text")
print(handle.read())
这种方法在fasta文件中给出了核苷酸和蛋白质序列,但核苷酸序列是整个基因组。
如果有人能帮助我,我将不胜感激。 提前致谢!
希望对你有帮助
import Bio
from Bio import Entrez
from Bio import SeqIO
Entrez.email = "mail@example.com"
gi_protein = "GI:588489721"
gi_genome = "GI:588489708"
handle=Entrez.efetch(db="sequences", id=gi_protein,rettype="fasta", retmode="text")
protein = SeqIO.parse(handle, "fasta").next()
handle=Entrez.efetch(db="sequences", id=gi_genome, rettype="gbwithparts", retmode="text")
genome = SeqIO.parse(handle, "gb").next()
#to extract feature with 'id' equal to protein
feature = [f for f in gb.features if "db_xref" in f.qualifiers and gi_protein in f.qualifiers["db_xref"]]
#to get location of CDS
start = feature[0].location.start.position
end = feature[0].location.end.position
strand = feature[0].location.strand
seq = genome[start: end]
if strand == 1:
print seq.seq
else:
#if strand is -1 then to get reverse complement
print seq.reverse_complement().seq
print protein.seq
然后你得到:
ATGGATTATATTGTTTCAGCACGAAAATATCGTCCCTCTACCTTTGTTTCGGTGGTAGGG CAGCAGAACATCACCACTACATTAAAAAATGCCATTAAAGGCAGTCAACTGGCACACGCC TATCTTTTTTGCGGACCGCGAGGTGTGGGAAAGACGACTTGTGCCCGTATCTTTGCTAAA ACCATCAACTGTTCGAATATATCAGCTGATTTTGAAGCGTGTAATGAGTGTGAATCCTGT AAGTCTTTTAATGAGAATCGTTCTTATAATATTCATGAACTGGATGGAGCCTCCAATAAC TCAGTAGAGGATATCAGGAGTCTGATTGATAAAGTTCGTGTTCCACCTCAGATAGGTAGT TATAGTGTATATATTATCGATGAGGTTCACATGTTATCGCAGGCAGCTTTTAATGCTTTT CTTAAAACATTGGAAGAGCCACCCAAGCATGCCATCTTTATTTTGGCCACTACTGAAAAA CATAAAATACTACCAACGATCCTGTCTCGTTGCCAGATTTACGATTTTAATAGGATTACC ATTGAAGATGCGGTAGGTCATTTAAAATATGTAGCAGAGAGTGAGCATATAACTGTGGAA GAAGAGGGGTTAACCGTCATTGCACAAAAAGCTGATGGAGCTATGCGGGATGCACTTTCC ATCTTTGATCAGATTGTGGCTTTCTCAGGTAAAAGTATCAGCTATCAGCAAGTAATCGAT AATTTGAATGTATTGGATTATGATTTTTACTTTAGGTTGGTGGATGCTTTTCTGGCAGAA GATACTACACAAACACTATTGATTTTTGATGAGATATTGAAACGGGGATTTGATGCACAT CATTTTATTTCCGGTTTAAGTTCTCATTTGCGTGATTTACTTGTATGTAAGGATGCAGCC ACCATTCAGTTGCTGGATGTGGGTGCTAAAATTAAGGAGAAGTACGGTGTTCAGGCGCAA AAAAGTACGATTGACTTTTTAATGGATGCTTTAAATATTACCAACGATTGCGATTTGCAA TATAGGGTGGCTAAAAATAAGCGTTTGCATGTGGAGTTTGCTCTTCTTAAGATAGCACGT GTATTAGATGAACAAAGAAAAAAGTAG MDYIVSARKYRPSTFVSVVGQQNITTTLKNAIKGSQLAHAYLFCGPRGVGKTTCARIFAK TINCSNISADFEACNECESCKSFNENRSYNIHELDGASNNSVEDIRSLIDKVRVPPQIGS YSVYIIDEVHMLSQAAFNAFLKTLEEPPKHAIFILATTEKHKILPTILSRCQIYDFNRIT IEDAVGHLKYVAESEHITVEEEGLTVIAQKADGAMRDALSIFDQIVAFSGKSISYQQVID NLNVLDYDFYFRLVDAFLAEDTTQTLLIFDEILKRGFDAHHFISGLSSHLRDLLVCKDAA TIQLLDVGAKIKEKYGVQAQKSTIDFLMDALNITNDCDLQYRVAKNKRLHVEFALLKIAR VLDEQRKK