Biopython:如何下载 NCBI 中的所有肽序列(或与特定术语相关的所有记录)?
Biopython: How to download all of the peptide sequences (or all records associated to a particular term) in NCBI?
我想以 fasta 格式下载 NCBI 蛋白质数据库中的所有肽序列(即 > 和肽名称,后跟肽序列),我看到有一个 MESH 术语描述什么是肽 here,但我不知道如何合并它。
我写了这个:
import Bio
from Bio import Entrez
Entrez.email = 'test@gmail.com'
handle = Entrez.esearch(db="protein", term="peptide")
record = handle.read()
out_handle = open('myfasta.fasta', 'w')
out_handle.write(record.rstrip('\n'))
但它只打印出 995 个 ID,没有要归档的序列,我想知道是否有人可以证明我哪里出错了。
请注意 a search for the term peptide in the NCBI protein database returns 8187908 hits,因此请确保您确实想要将所有这些命中的肽序列下载到一个大的 fasta 文件中。
>>> from Bio import Entrez
>>> Entrez.email = 'test@gmail.com'
>>> handle = Entrez.esearch(db="protein", term="peptide")
>>> record = Entrez.read(handle)
>>> record["Count"]
'8187908'
Entrez.esearch returns 的默认记录数为 20。这是为了防止 NCBI 服务器过载。
>>> len(record["IdList"])
20
要获取完整的记录列表,请更改 retmax
参数:
>>> count = record["Count"]
>>> handle = Entrez.esearch(db="protein", term="peptide", retmax=count)
>>> record = Entrez.read(handle)
>>> len(record['IdList'])
8187908
下载所有记录的方法是使用Entrez.epost
来自 chapter 9.4 of the BioPython tutorial:
EPost uploads a list of UIs for use in subsequent search strategies; see the EPost help page for more information. It is available from Biopython through the Bio.Entrez.epost() function.
To give an example of when this is useful, suppose you have a long list of IDs you want to download using EFetch (maybe sequences, maybe citations – anything). When you make a request with EFetch your list of IDs, the database etc, are all turned into a long URL sent to the server. If your list of IDs is long, this URL gets long, and long URLs can break (e.g. some proxies don’t cope well).
Instead, you can break this up into two steps, first uploading the list of IDs using EPost (this uses an “HTML post” internally, rather than an “HTML get”, getting round the long URL problem). With the history support, you can then refer to this long list of IDs, and download the associated data with EFetch.
[...] The returned XML includes two important strings, QueryKey and WebEnv which together define your history session. You would extract these values for use with another Entrez call such as EFetch.
阅读[第 9.15 章:使用历史搜索和下载序列][3] 了解如何使用 QueryKey
和 WebEnv
一个完整的工作示例将是:
from Bio import Entrez
import time
from urllib.error import HTTPError
DB = "protein"
QUERY = "peptide"
Entrez.email = 'test@gmail.com'
handle = Entrez.esearch(db=DB, term=QUERY, rettype='fasta')
record = Entrez.read(handle)
count = record['Count']
handle = Entrez.esearch(db=DB, term=QUERY, retmax=count, rettype='fasta')
record = Entrez.read(handle)
id_list = record['IdList']
post_xml = Entrez.epost(DB, id=",".join(id_list))
search_results = Entrez.read(post_xml)
webenv = search_results['WebEnv']
query_key = search_results['QueryKey']
batch_size = 200
with open('peptides.fasta', 'w') as out_handle:
for start in range(0, count, batch_size):
end = min(count, start+batch_size)
print(f"Going to download record {start+1} to {end}")
attempt = 0
success = False
while attempt < 3 and not success:
attempt += 1
try:
fetch_handle = Entrez.efetch(db=DB, rettype='fasta',
retstart=start, retmax=batch_size,
webenv=webenv, query_key=query_key)
success = True
except HTTPError as err:
if 500 <= err.code <= 599:
print(f"Received error from server {err}")
print("Attempt {attempt} of 3")
time.sleep(15)
else:
raise
data = fetch_handle.read()
fetch_handle.close()
out_handle.write(data)
peptides.fasta
的前几行是这样的:
>QGT67293.1 RepA leader peptide Tap (plasmid) [Klebsiella pneumoniae]
MLRKLQAQFLCHSLLLCNISAGSGD
>QGT67288.1 RepA leader peptide Tap (plasmid) [Klebsiella pneumoniae]
MLRKLQAQFLCHSLLLCNISAGSGD
>QGT67085.1 thr operon leader peptide [Klebsiella pneumoniae]
MNRIGMITTIITTTITTGNGAG
>QGT67083.1 leu operon leader peptide [Klebsiella pneumoniae]
MIRTARITSLLLLNACHLRGRLLGDVQR
>QGT67062.1 peptide antibiotic transporter SbmA [Klebsiella pneumoniae]
MFKSFFPKPGPFFISAFIWSMLAVIFWQAGGGDWLLRVTGASQNVAISAARFWSLNYLVFYAYYLFCVGV
FALFWFVYCPHRWQYWSILGTSLIIFVTWFLVEVGVAINAWYAPFYDLIQSALATPHKVSINQFYQEIGV
FLGIAIIAVIIGVMNNFFVSHYVFRWRTAMNEHYMAHWQHLRHIEGAAQRVQEDTMRFASTLEDMGVSFI
NAVMTLIAFLPVLVTLSEHVPDLPIVGHLPYGLVIAAIVWSLMGTGLLAVVGIKLPGLEFKNQRVEAAYR
KELVYGEDDETRATPPTVRELFRAVRRNYFRLYFHYMYFNIARILYLQVDNVFGLFLLFPSIVAGTITLG
LMTQITNVFGQVRGSFQYLISSWTTLVELMSIYKRLRSFERELDGKPLQEAIPTLR
我想以 fasta 格式下载 NCBI 蛋白质数据库中的所有肽序列(即 > 和肽名称,后跟肽序列),我看到有一个 MESH 术语描述什么是肽 here,但我不知道如何合并它。
我写了这个:
import Bio
from Bio import Entrez
Entrez.email = 'test@gmail.com'
handle = Entrez.esearch(db="protein", term="peptide")
record = handle.read()
out_handle = open('myfasta.fasta', 'w')
out_handle.write(record.rstrip('\n'))
但它只打印出 995 个 ID,没有要归档的序列,我想知道是否有人可以证明我哪里出错了。
请注意 a search for the term peptide in the NCBI protein database returns 8187908 hits,因此请确保您确实想要将所有这些命中的肽序列下载到一个大的 fasta 文件中。
>>> from Bio import Entrez
>>> Entrez.email = 'test@gmail.com'
>>> handle = Entrez.esearch(db="protein", term="peptide")
>>> record = Entrez.read(handle)
>>> record["Count"]
'8187908'
Entrez.esearch returns 的默认记录数为 20。这是为了防止 NCBI 服务器过载。
>>> len(record["IdList"])
20
要获取完整的记录列表,请更改 retmax
参数:
>>> count = record["Count"]
>>> handle = Entrez.esearch(db="protein", term="peptide", retmax=count)
>>> record = Entrez.read(handle)
>>> len(record['IdList'])
8187908
下载所有记录的方法是使用Entrez.epost
来自 chapter 9.4 of the BioPython tutorial:
EPost uploads a list of UIs for use in subsequent search strategies; see the EPost help page for more information. It is available from Biopython through the Bio.Entrez.epost() function.
To give an example of when this is useful, suppose you have a long list of IDs you want to download using EFetch (maybe sequences, maybe citations – anything). When you make a request with EFetch your list of IDs, the database etc, are all turned into a long URL sent to the server. If your list of IDs is long, this URL gets long, and long URLs can break (e.g. some proxies don’t cope well).
Instead, you can break this up into two steps, first uploading the list of IDs using EPost (this uses an “HTML post” internally, rather than an “HTML get”, getting round the long URL problem). With the history support, you can then refer to this long list of IDs, and download the associated data with EFetch.
[...] The returned XML includes two important strings, QueryKey and WebEnv which together define your history session. You would extract these values for use with another Entrez call such as EFetch.
阅读[第 9.15 章:使用历史搜索和下载序列][3] 了解如何使用 QueryKey
和 WebEnv
一个完整的工作示例将是:
from Bio import Entrez
import time
from urllib.error import HTTPError
DB = "protein"
QUERY = "peptide"
Entrez.email = 'test@gmail.com'
handle = Entrez.esearch(db=DB, term=QUERY, rettype='fasta')
record = Entrez.read(handle)
count = record['Count']
handle = Entrez.esearch(db=DB, term=QUERY, retmax=count, rettype='fasta')
record = Entrez.read(handle)
id_list = record['IdList']
post_xml = Entrez.epost(DB, id=",".join(id_list))
search_results = Entrez.read(post_xml)
webenv = search_results['WebEnv']
query_key = search_results['QueryKey']
batch_size = 200
with open('peptides.fasta', 'w') as out_handle:
for start in range(0, count, batch_size):
end = min(count, start+batch_size)
print(f"Going to download record {start+1} to {end}")
attempt = 0
success = False
while attempt < 3 and not success:
attempt += 1
try:
fetch_handle = Entrez.efetch(db=DB, rettype='fasta',
retstart=start, retmax=batch_size,
webenv=webenv, query_key=query_key)
success = True
except HTTPError as err:
if 500 <= err.code <= 599:
print(f"Received error from server {err}")
print("Attempt {attempt} of 3")
time.sleep(15)
else:
raise
data = fetch_handle.read()
fetch_handle.close()
out_handle.write(data)
peptides.fasta
的前几行是这样的:
>QGT67293.1 RepA leader peptide Tap (plasmid) [Klebsiella pneumoniae]
MLRKLQAQFLCHSLLLCNISAGSGD
>QGT67288.1 RepA leader peptide Tap (plasmid) [Klebsiella pneumoniae]
MLRKLQAQFLCHSLLLCNISAGSGD
>QGT67085.1 thr operon leader peptide [Klebsiella pneumoniae]
MNRIGMITTIITTTITTGNGAG
>QGT67083.1 leu operon leader peptide [Klebsiella pneumoniae]
MIRTARITSLLLLNACHLRGRLLGDVQR
>QGT67062.1 peptide antibiotic transporter SbmA [Klebsiella pneumoniae]
MFKSFFPKPGPFFISAFIWSMLAVIFWQAGGGDWLLRVTGASQNVAISAARFWSLNYLVFYAYYLFCVGV
FALFWFVYCPHRWQYWSILGTSLIIFVTWFLVEVGVAINAWYAPFYDLIQSALATPHKVSINQFYQEIGV
FLGIAIIAVIIGVMNNFFVSHYVFRWRTAMNEHYMAHWQHLRHIEGAAQRVQEDTMRFASTLEDMGVSFI
NAVMTLIAFLPVLVTLSEHVPDLPIVGHLPYGLVIAAIVWSLMGTGLLAVVGIKLPGLEFKNQRVEAAYR
KELVYGEDDETRATPPTVRELFRAVRRNYFRLYFHYMYFNIARILYLQVDNVFGLFLLFPSIVAGTITLG
LMTQITNVFGQVRGSFQYLISSWTTLVELMSIYKRLRSFERELDGKPLQEAIPTLR