从 Python 中的 txt 批次中提取 xml 格式的 PubMed 数据
Extracting PubMed data in xml format from txt batches in Python
我之前问过,完美解决
下面是多个传统 xml
文件的完美工作代码。
import pandas as pd
from glob import glob
from bs4 import BeautifulSoup
l = list()
for f in glob('*.xml'): # Changed to .txt here
pub = dict()
with open(f, 'r') as xml_file:
xml = xml_file.read()
soup = BeautifulSoup(xml, "lxml")
pub['PMID'] = soup.find('pmid').text
pub_list = soup.find('publicationtypelist')
pub['Publication_type'] = list()
for pub_type in pub_list.find_all('publicationtype'):
pub['Publication_type'].append(pub_type.text)
try:
pub['NCTID'] = soup.find('accessionnumber').text
except:
pub['NCTID'] = None
l.append(pub)
df = pd.DataFrame(l)
df = df.explode('Publication_type', ignore_index=True)
它给了我想要的输出
PMID Publication_type NCTID
0 34963793 Journal Article NCT02649218
1 34963793 Review NCT02649218
2 34535952 Journal Article None
3 34090787 Journal Article NCT02424799
4 33615122 Journal Article NCT01922037
自那以后我唯一改变的是 - 我提取了数据,using R and easyPubMed
package。数据分批提取(每批 100 篇文章)并以 xml
格式存储在 txt
文档中。我总共有150个txt文档。它现在仅提取 ~25000 行,而不是 ~25000 行。
当输入文件发生变化时,如何更新上面的 Python 代码并获得相同的输出?我 add several txt
files here 重现性。需要提取 PMID
, Publication_type
, NCTID
.
之前的代码仅为 一篇 文章的 XML 而非数百篇文章的 XML 构建数据框。因此,您需要捕获XML中每个<PubmedArticle>
个实例下的select个节点。目前每个 XML.
中仅捕获第一篇文章
考虑 etree 的 iterparse
解决方案,该解决方案内存密集度较低,可以读取大文件 XML,您可以在 <PubmedArticle>
节点的打开和关闭之间提取所需的节点:
import pandas as pd
import xml.etree.ElementTree as ET
data = [] # INITIALIZE DATA LIST
for xml_file in glob('*.txt'):
for event, elem in ET.iterparse(xml_file, events=('start', 'end')):
if event == 'start':
if elem.tag == "PubmedArticle":
pub = {} # INITIALIZE ARTICLE DICT
if elem.tag == 'PMID':
pub["PMID"] = elem.text
pub["PublicationType"] = []
pub["NCTID"] = None
elif elem.tag == 'PublicationType':
pub["PublicationType"].append(elem.text)
elif elem.tag == 'AccessionNumber':
pub["NCTID"] = elem.text
if event == 'end':
if elem.tag == "PubmedArticle":
pub["Source"] = xml_file
data.append(pub) # APPEND MULTIPLE ARTICLES
elem.clear()
# BUILD XML DATA FRAME
final_df = (
pd.DataFrame(data)
.explode('PublicationType', ignore_index=True)
)
我之前问过
下面是多个传统 xml
文件的完美工作代码。
import pandas as pd
from glob import glob
from bs4 import BeautifulSoup
l = list()
for f in glob('*.xml'): # Changed to .txt here
pub = dict()
with open(f, 'r') as xml_file:
xml = xml_file.read()
soup = BeautifulSoup(xml, "lxml")
pub['PMID'] = soup.find('pmid').text
pub_list = soup.find('publicationtypelist')
pub['Publication_type'] = list()
for pub_type in pub_list.find_all('publicationtype'):
pub['Publication_type'].append(pub_type.text)
try:
pub['NCTID'] = soup.find('accessionnumber').text
except:
pub['NCTID'] = None
l.append(pub)
df = pd.DataFrame(l)
df = df.explode('Publication_type', ignore_index=True)
它给了我想要的输出
PMID Publication_type NCTID
0 34963793 Journal Article NCT02649218
1 34963793 Review NCT02649218
2 34535952 Journal Article None
3 34090787 Journal Article NCT02424799
4 33615122 Journal Article NCT01922037
自那以后我唯一改变的是 - 我提取了数据,using R and easyPubMed
package。数据分批提取(每批 100 篇文章)并以 xml
格式存储在 txt
文档中。我总共有150个txt文档。它现在仅提取 ~25000 行,而不是 ~25000 行。
当输入文件发生变化时,如何更新上面的 Python 代码并获得相同的输出?我 add several txt
files here 重现性。需要提取 PMID
, Publication_type
, NCTID
.
之前的代码仅为 一篇 文章的 XML 而非数百篇文章的 XML 构建数据框。因此,您需要捕获XML中每个<PubmedArticle>
个实例下的select个节点。目前每个 XML.
考虑 etree 的 iterparse
解决方案,该解决方案内存密集度较低,可以读取大文件 XML,您可以在 <PubmedArticle>
节点的打开和关闭之间提取所需的节点:
import pandas as pd
import xml.etree.ElementTree as ET
data = [] # INITIALIZE DATA LIST
for xml_file in glob('*.txt'):
for event, elem in ET.iterparse(xml_file, events=('start', 'end')):
if event == 'start':
if elem.tag == "PubmedArticle":
pub = {} # INITIALIZE ARTICLE DICT
if elem.tag == 'PMID':
pub["PMID"] = elem.text
pub["PublicationType"] = []
pub["NCTID"] = None
elif elem.tag == 'PublicationType':
pub["PublicationType"].append(elem.text)
elif elem.tag == 'AccessionNumber':
pub["NCTID"] = elem.text
if event == 'end':
if elem.tag == "PubmedArticle":
pub["Source"] = xml_file
data.append(pub) # APPEND MULTIPLE ARTICLES
elem.clear()
# BUILD XML DATA FRAME
final_df = (
pd.DataFrame(data)
.explode('PublicationType', ignore_index=True)
)