使用 Python 从蛋白质数据库下载特定的 .pdb 文件

Question

我一直在尝试从蛋白质数据库下载 .pdb 文件。我已经编写了以下代码块来提取这些文件，但是我正在下载的文件包含网页。

#Sector C - Processing block:
RefinedPDBCodeList = [] #C1
with open('RefinedPDBCodeList') as inputfile:
    for line in inputfile:
         RefinedPDBCodeList.append(line.strip().split(','))

print(RefinedPDBCodeList[0])
['101m.pdb']

import urllib.request      
for i in range(0, 1): #S2 - range(0, len(RefinedPDBCodeList)):
    path=urllib.request.urlretrieve('http://www.rcsb.org/pdb/explore/explore.do?structureId=101m', '101m.pdb')

Answer 1

看来你的底数 url 错了。改为尝试：

urllib.request.urlretrieve('http://files.rcsb.org/download/101M.pdb', '101m.pdb')

Answer 2

URL 已经更新（虽然旧的 URL 暂时重定向到新的）：

urllib.request.urlretrieve('https://files.rcsb.org/download/101M.pdb', '101m.pdb')

有关可从 RCSB PDB 获得的不同下载的 URL 的完整列表，请参阅 https://www.rcsb.org/pdb/static.do?p=download/http/index.html。

Answer 3

BioPython 提供了一种检索方法PDBList.retrieve_pdb_file。但是，这依赖于 PDB FTP 服务。如果 FTP 端口由于某种原因（防火墙等）没有打开，那么你可以使用这个函数：

def download_pdb(pdbcode, datadir, downloadurl="https://files.rcsb.org/download/"):
    """
    Downloads a PDB file from the Internet and saves it in a data directory.
    :param pdbcode: The standard PDB ID e.g. '3ICB' or '3icb'
    :param datadir: The directory where the downloaded file will be saved
    :param downloadurl: The base PDB download URL, cf.
        `https://www.rcsb.org/pages/download/http#structures` for details
    :return: the full path to the downloaded PDB file or None if something went wrong
    """
    pdbfn = pdbcode + ".pdb"
    url = downloadurl + pdbfn
    outfnm = os.path.join(datadir, pdbfn)
    try:
        urllib.request.urlretrieve(url, outfnm)
        return outfnm
    except Exception as err:
        print(str(err), file=sys.stderr)
        return None

使用 Python 从蛋白质数据库下载特定的 .pdb 文件

Using Python to download specific .pdb files from Protein Data Bank

python

chemistry