如何下载已发布的文章并阅读？

Question

我无法保存和阅读已发布的文章。我在这个页面 here 看到有一些特殊的文件类型，但没有一种适合我。我想以一种可以连续使用键获取数据的方式保存它们。我不知道如果我将它保存为文本文件是否可以使用它。我的代码是这个：

import sys
from Bio import Entrez
import re
import os
from Bio import Medline
from Bio import SeqIO

'''Class Crawler is responsable to browse the biological databases
from DownloadArticles import DownloadArticles
c = DownloadArticles()
c.articles_dataset_list
'''
class DownloadArticles():
    def __init__(self):
        Entrez.email='myemail@gmail.com'
        self.dataC = self.saveArticlesFilesInXMLMode('pubmed', '26837606')

    '''Metodo 4 ler dado em forma de texto.'''  
    def saveArticlesFilesInXMLMode(self,dbs, ids):
        net_handle = Entrez.efetch(db=dbs, id=ids, rettype="medline", retmode="txt")
        directory = "/dataset/Pubmed/DatasetArticles/"+ ids + ".fasta"
        # if not os.path.exists(directory):
        # os.makedirs(directory)
        # filename = directory + '/'
        # if not os.path.exists(filename):
        out_handle = open(directory, "w+")
        out_handle.write(net_handle.read())
        out_handle.close()
        net_handle.close()
        print("Saved")
        print("Parsing...")
        record = SeqIO.read(directory, "fasta")
        print(record)
        return(record.read())

我收到此错误：ValueError: No records found in handle请问有人可以帮助我吗？

现在我的代码是这样的，我正在尝试像您一样做一个保存在 .fasta 中的功能。还有一个像上面的答案一样阅读 .fasta 文件。

import sys
from Bio import Entrez
import re
import os
from Bio import Medline
from Bio import SeqIO

def save_Articles_Files(dbName, idNum, rettypeName):
    net_handle = Entrez.efetch(db=dbName, id=idNum, rettype=rettypeName, retmode="txt")
    filename = path  + idNum + ".fasta"
    out_handle = open(filename, "w")
    out_handle.write(net_handle.read())
    out_handle.close()
    net_handle.close()
    print("Saved")
enter code here

Entrez.email='myemail@gmail.com'
dbName = 'pubmed'
idNum = '26837606'
rettypeName = "medline"
path ="/run/media/Dropbox/codigos/Codes/"+dbName
save_Articles_Files(dbName, idNum, rettypeName)

但是我的功能不工作我需要一些帮助！

Answer 1

您混淆了两个概念。

1) Entrez.efetch()用于访问NCBI。在您的情况下，您正在从 Pubmed 下载一篇文章。您从 net_handle.read() 获得的结果如下所示：

PMID- 26837606
OWN - NLM
STAT- In-Process
DA  - 20160203
LR  - 20160210
IS  - 2045-2322 (Electronic)
IS  - 2045-2322 (Linking)
VI  - 6
DP  - 2016 Feb 03
TI  - Exploiting the CRISPR/Cas9 System for Targeted Genome Mutagenesis in Petunia.
PG  - 20315
LID - 10.1038/srep20315 [doi]
AB  - Recently, CRISPR/Cas9 technology has emerged as a powerful approach for targeted 
      genome modification in eukaryotic organisms from yeast to human cell lines. Its
      successful application in several plant species promises enormous potential for
      basic and applied plant research. However, extensive studies are still needed to 
      assess this system in other important plant species, to broaden its fields of
      application and to improve methods. Here we showed that the CRISPR/Cas9 system is
      efficient in petunia (Petunia hybrid), an important ornamental plant and a model 
      for comparative research. When PDS was used as target gene, transgenic shoot
      lines with albino phenotype accounted for 55.6%-87.5% of the total regenerated T0
      Basta-resistant lines. A homozygous deletion close to 1 kb in length can be
      readily generated and identified in the first generation. A sequential
      transformation strategy--introducing Cas9 and sgRNA expression cassettes
      sequentially into petunia--can be used to make targeted mutations with short
      indels or chromosomal fragment deletions. Our results present a new plant species
      amenable to CRIPR/Cas9 technology and provide an alternative procedure for its
      exploitation.
FAU - Zhang, Bin
AU  - Zhang B
AD  - Chongqing Engineering Research Centre for Floriculture, Key Laboratory of
      Horticulture Science for Southern Mountainous Regions, Ministry of Education,
      College of Horticulture and Landscape Architecture, Southwest University,
      Chongqing 400716, China.
FAU - Yang, Xia
AU  - Yang X
AD  - Chongqing Engineering Research Centre for Floriculture, Key Laboratory of
      Horticulture Science for Southern Mountainous Regions, Ministry of Education,
      College of Horticulture and Landscape Architecture, Southwest University,
      Chongqing 400716, China.
FAU - Yang, Chunping
AU  - Yang C
AD  - Chongqing Engineering Research Centre for Floriculture, Key Laboratory of
      Horticulture Science for Southern Mountainous Regions, Ministry of Education,
      College of Horticulture and Landscape Architecture, Southwest University,
      Chongqing 400716, China.
FAU - Li, Mingyang
AU  - Li M
AD  - Chongqing Engineering Research Centre for Floriculture, Key Laboratory of
      Horticulture Science for Southern Mountainous Regions, Ministry of Education,
      College of Horticulture and Landscape Architecture, Southwest University,
      Chongqing 400716, China.
FAU - Guo, Yulong
AU  - Guo Y
AD  - Chongqing Engineering Research Centre for Floriculture, Key Laboratory of
      Horticulture Science for Southern Mountainous Regions, Ministry of Education,
      College of Horticulture and Landscape Architecture, Southwest University,
      Chongqing 400716, China.
LA  - eng
PT  - Journal Article
PT  - Research Support, Non-U.S. Gov't
DEP - 20160203
PL  - England
TA  - Sci Rep
JT  - Scientific reports
JID - 101563288
SB  - IM
PMC - PMC4738242
OID - NLM: PMC4738242
EDAT- 2016/02/04 06:00
MHDA- 2016/02/04 06:00
CRDT- 2016/02/04 06:00
PHST- 2015/09/21 [received]
PHST- 2015/12/30 [accepted]
AID - srep20315 [pii]
AID - 10.1038/srep20315 [doi]
PST - epublish
SO  - Sci Rep. 2016 Feb 3;6:20315. doi: 10.1038/srep20315.

2) SeqIO.read()用于读取和解析FASTA files。这是一种用于存储序列的格式。 FASTA 格式的序列表示为一系列线条。 FASTA 文件的第一行以“>”（大于）符号开头。在第一行之后（用于序列的唯一描述）是标准单字母代码中的实际序列本身。

如您所见，从 Entrez.efetch() 返回的结果（我在上面粘贴）看起来不像 FASTA 文件。所以SeqIO.read()给出了在文件中找不到任何序列记录的错误。

如何下载已发布的文章并阅读？

How to download pubmed articles and read them?

io

bioinformatics

biopython

python-3.x