Biopython 的 ESearch 没有给我完整的 IdList

Biopython's ESearch does not give me full IdList

我正在尝试使用以下代码搜索一些文章:

handle = Entrez.esearch(db="pubmed", term="lung+cancer")
record = Entrez.read(handle)

record['Count'] 我可以看到有 293279 个结果,但是当我查看 record['IdList'] 时它只给了我 20 个 ID。这是为什么?如何获取所有 293279 条记录?

Entrez.esearch returns 的默认记录数为 20。这是为了防止 NCBI 服务器过载。要获取完整的记录列表,请更改 retmax 参数:

>>> from Bio import Entrez
>>> Entrez.email = "A.N.Other@example.com"     # Always tell NCBI who you are
>>> handle = Entrez.esearch(db="pubmed", term="lung+cancer")
>>> record = Entrez.read(handle)
>>> count = record['Count']
>>> handle = Entrez.esearch(db="pubmed", term="lung+cancer", retmax=count)
>>> record = Entrez.read(handle)
>>> print len(record['IdList'])
293279 

下载所有记录的方法是使用Entrez.epost.

来自chapter 9.4 of the BioPython tutorial

EPost uploads a list of UIs for use in subsequent search strategies; see the EPost help page for more information. It is available from Biopython through the Bio.Entrez.epost() function.

To give an example of when this is useful, suppose you have a long list of IDs you want to download using EFetch (maybe sequences, maybe citations – anything). When you make a request with EFetch your list of IDs, the database etc, are all turned into a long URL sent to the server. If your list of IDs is long, this URL gets long, and long URLs can break (e.g. some proxies don’t cope well).

Instead, you can break this up into two steps, first uploading the list of IDs using EPost (this uses an “HTML post” internally, rather than an “HTML get”, getting round the long URL problem). With the history support, you can then refer to this long list of IDs, and download the associated data with EFetch.

[...] The returned XML includes two important strings, QueryKey and WebEnv which together define your history session. You would extract these values for use with another Entrez call such as EFetch.

阅读 chapter 9.15.: Searching for and downloading sequences using the history 了解如何使用 QueryKeyWebEnv

一个完整的工作示例将是:

from Bio import Entrez
import time

Entrez.email = "A.N.Other@example.com" 
handle = Entrez.esearch(db="pubmed", term="lung+cancer")
record = Entrez.read(handle)

count = int(record['Count'])
handle = Entrez.esearch(db="pubmed", term="lung+cancer", retmax=count)
record = Entrez.read(handle)

id_list = record['IdList']
post_xml = Entrez.epost("pubmed", id=",".join(id_list))
search_results = Entrez.read(post_xml)

webenv = search_results["WebEnv"]
query_key = search_results["QueryKey"] 

try:
    from urllib.error import HTTPError  # for Python 3
except ImportError:
    from urllib2 import HTTPError  # for Python 2

batch_size = 200
out_handle = open("lung_cancer.txt", "w")
for start in range(0, count, batch_size):
    end = min(count, start+batch_size)
    print("Going to download record %i to %i" % (start+1, end))
    attempt = 0
    success = False
    while attempt < 3 and not success:
        attempt += 1
        try:
            fetch_handle = Entrez.efetch(db="pubmed",
                                         retstart=start, retmax=batch_size,
                                         webenv=webenv, query_key=query_key)
            success = True
        except HTTPError as err:
            if 500 <= err.code <= 599:
                print("Received error from server %s" % err)
                print("Attempt %i of 3" % attempt)
                time.sleep(15)
            else:
                raise
    data = fetch_handle.read()
    fetch_handle.close()
    out_handle.write(data)
out_handle.close()