解析 PubMed 数据并从多个文件中提取多列
Parsing PubMed data and extracting multiple columns from multiple files
我有多个来自 PubMed 的 xml
文件。几个文件是here.
如何解析它并在单个数据框中获取这些列。
如果一篇文章有多个作者,我希望将它们作为单独的行
预期产出(应包括所有作者):
Title Year ArticleTitle LastName ForeName
Nature 2021 Inter-mosaic ... Roy Suva
Nature 2021 Inter-mosaic ... Pearson John
Nature 2021 Neural dynamics Pearson John
Nature 2021 Neural dynamics Mooney Richard
首先,你想要的是可行的。像这样的东西应该适用于你的第二个文件,你可以通过用 for
循环包装代码来添加其他文件:
from lxml import etree
import pandas as pd
doc = etree.parse('file.xml')
columns = ['Title','ArticleDate','ArticleTitle','LastName','ForeName']
title = doc.xpath(f'//{columns[0]}/text()')[0]
year = doc.xpath(f'//{columns[1]}//Year/text()')[0]
article_title = doc.xpath(f'//{columns[2]}/text()')[0]
rows = []
for auth in doc.xpath('//Author'):
last_name = auth.xpath(f'{columns[3]}/text()')[0]
fore_name = auth.xpath(f'{columns[4]}/text()')[0]
rows.append([title,year,article_title,last_name,fore_name])
pd.DataFrame(rows,columns=columns)
输出(34671166.xml):
Title ArticleDate ArticleTitle LastName ForeName
0 Nature 2021 Neural dynamics underlying birdsong practice a... Singh Alvarado Jonnathan
1 Nature 2021 Neural dynamics underlying birdsong practice a... Goffinet Jack
2 Nature 2021 Neural dynamics underlying birdsong practice a... Michael Valerie
3 Nature 2021 Neural dynamics underlying birdsong practice a... Liberti William
4 Nature 2021 Neural dynamics underlying birdsong practice a... Hatfield Jordan
5 Nature 2021 Neural dynamics underlying birdsong practice a... Gardner Timothy
6 Nature 2021 Neural dynamics underlying birdsong practice a... Pearson John
7 Nature 2021 Neural dynamics underlying birdsong practice a... Mooney Richard
话虽如此,我不确定每个作者在单独一行中的数据框是否适合您拥有的数据类型。在此示例中,由于您有 8 co-authors,因此文章标题等信息不必要地重复了 8 次。你可以给每个作者一组单独的专栏,但是如果文章有 3 或 10 个 co-authors...
,你就会遇到问题
我有多个来自 PubMed 的 xml
文件。几个文件是here.
如何解析它并在单个数据框中获取这些列。 如果一篇文章有多个作者,我希望将它们作为单独的行
预期产出(应包括所有作者):
Title Year ArticleTitle LastName ForeName
Nature 2021 Inter-mosaic ... Roy Suva
Nature 2021 Inter-mosaic ... Pearson John
Nature 2021 Neural dynamics Pearson John
Nature 2021 Neural dynamics Mooney Richard
首先,你想要的是可行的。像这样的东西应该适用于你的第二个文件,你可以通过用 for
循环包装代码来添加其他文件:
from lxml import etree
import pandas as pd
doc = etree.parse('file.xml')
columns = ['Title','ArticleDate','ArticleTitle','LastName','ForeName']
title = doc.xpath(f'//{columns[0]}/text()')[0]
year = doc.xpath(f'//{columns[1]}//Year/text()')[0]
article_title = doc.xpath(f'//{columns[2]}/text()')[0]
rows = []
for auth in doc.xpath('//Author'):
last_name = auth.xpath(f'{columns[3]}/text()')[0]
fore_name = auth.xpath(f'{columns[4]}/text()')[0]
rows.append([title,year,article_title,last_name,fore_name])
pd.DataFrame(rows,columns=columns)
输出(34671166.xml):
Title ArticleDate ArticleTitle LastName ForeName
0 Nature 2021 Neural dynamics underlying birdsong practice a... Singh Alvarado Jonnathan
1 Nature 2021 Neural dynamics underlying birdsong practice a... Goffinet Jack
2 Nature 2021 Neural dynamics underlying birdsong practice a... Michael Valerie
3 Nature 2021 Neural dynamics underlying birdsong practice a... Liberti William
4 Nature 2021 Neural dynamics underlying birdsong practice a... Hatfield Jordan
5 Nature 2021 Neural dynamics underlying birdsong practice a... Gardner Timothy
6 Nature 2021 Neural dynamics underlying birdsong practice a... Pearson John
7 Nature 2021 Neural dynamics underlying birdsong practice a... Mooney Richard
话虽如此,我不确定每个作者在单独一行中的数据框是否适合您拥有的数据类型。在此示例中,由于您有 8 co-authors,因此文章标题等信息不必要地重复了 8 次。你可以给每个作者一组单独的专栏,但是如果文章有 3 或 10 个 co-authors...
,你就会遇到问题