Pubmed 将文章详细信息提取到数据框
Pubmed fetch article details to a daframe
这是代码。
import pandas as pd
from pymed import PubMed
import numpy as np
pubmed = PubMed(tool="PubMedSearcher", email="myemail@ccc.com")
## PUT YOUR SEARCH TERM HERE ##
search_term = 'Charlie Brown'
results = pubmed.query(search_term, max_results=100000)
articleList = []
articleInfo = []
for article in results:
# Print the type of object we've found (can be either PubMedBookArticle or PubMedArticle).
# We need to convert it to dictionary with available function
articleDict = article.toDict()
articleList.append(articleDict)
# Generate list of dict records which will hold all article details that could be fetch from PUBMED API
for article in articleList:
#Sometimes article['pubmed_id'] contains list separated with comma - take first pubmedId in that list - thats article pubmedId
pubmedId = article['pubmed_id'].partition('\n')[0]
# Append article info to dictionary
articleInfo.append({u'pubmed_id':pubmedId,
u'publication_date':article['publication_date'],
u'authors':article['authors']})
df=pd.json_normalize(articleInfo)
运行 此代码将获取三列,pubmed_id、publication_date 和 authors.
有没有办法取消作者栏的嵌套并保留其他两栏?非常感谢。
如果你想取消嵌套,你必须定义一些策略。例如,您可以使用 lastname, firstname
加入作者,将每个作者拆分为 ;
:
# New column to easily identify how many authors there are in the paper
df['n_authors'] = df['authors'].map(len)
# Unnest authors into a single string using the above-mentioned strategy
df['authors'] = df['authors'].map(lambda authors: ';'.join([f"{author['lastname']}, {author['firstname']}" for author in authors]))
输出:
pubmed_id publication_date authors n_authors
0 35435469 2022-04-19 Easwaran, Raju;Khan, Moin;Sancheti, Parag;Shya... 41
1 34480858 2021-09-05 Flaxman, Amy;Marchevsky, Natalie G;Jenkin, Dan... 38
2 30857579 2019-03-13 Brown, Charlie 1
3 28640023 2017-06-24 Thornton, Kevin C;Schwarz, Jennifer J;Gross, A... 12
4 24195874 2013-11-08 Bicket, Mark C;Gupta, Anita;Brown, Charlie H;C... 4
5 21741796 2011-07-12 Bird, Jonathan H;Carmont, Michael R;Dhillon, M... 7
6 21324873 2011-02-18 Cohen, Steven P;Brown, Charlie;Kurihara, Conni... 6
7 20228712 2010-03-17 Cohen, Steven P;Kapoor, Shruti G;Nguyen, Cuong... 8
8 20109957 2010-01-30 Cohen, Steven P;Brown, Charlie;Kurihara, Conni... 6
9 18248779 2008-02-06 Whitaker, Iain S;Duggan, Eileen M;Alloway, Rit... 10
10 16917639 2006-08-19 Drayton, William;Brown, Charlie;Hillhouse, Karin 3
11 16282488 2005-11-12 Mao, Hanwen;Lafont, Bernard A P;Igarashi, Tats... 9
12 14581571 2003-10-29 Moniuszko, Marcin;Brown, Charlie;Pal, Ranajit;... 7
13 12163382 2002-08-07 Williams, Kenneth;Schwartz, Annette;Corey, Sar... 10
这是代码。
import pandas as pd
from pymed import PubMed
import numpy as np
pubmed = PubMed(tool="PubMedSearcher", email="myemail@ccc.com")
## PUT YOUR SEARCH TERM HERE ##
search_term = 'Charlie Brown'
results = pubmed.query(search_term, max_results=100000)
articleList = []
articleInfo = []
for article in results:
# Print the type of object we've found (can be either PubMedBookArticle or PubMedArticle).
# We need to convert it to dictionary with available function
articleDict = article.toDict()
articleList.append(articleDict)
# Generate list of dict records which will hold all article details that could be fetch from PUBMED API
for article in articleList:
#Sometimes article['pubmed_id'] contains list separated with comma - take first pubmedId in that list - thats article pubmedId
pubmedId = article['pubmed_id'].partition('\n')[0]
# Append article info to dictionary
articleInfo.append({u'pubmed_id':pubmedId,
u'publication_date':article['publication_date'],
u'authors':article['authors']})
df=pd.json_normalize(articleInfo)
运行 此代码将获取三列,pubmed_id、publication_date 和 authors
有没有办法取消作者栏的嵌套并保留其他两栏?非常感谢。
如果你想取消嵌套,你必须定义一些策略。例如,您可以使用 lastname, firstname
加入作者,将每个作者拆分为 ;
:
# New column to easily identify how many authors there are in the paper
df['n_authors'] = df['authors'].map(len)
# Unnest authors into a single string using the above-mentioned strategy
df['authors'] = df['authors'].map(lambda authors: ';'.join([f"{author['lastname']}, {author['firstname']}" for author in authors]))
输出:
pubmed_id publication_date authors n_authors
0 35435469 2022-04-19 Easwaran, Raju;Khan, Moin;Sancheti, Parag;Shya... 41
1 34480858 2021-09-05 Flaxman, Amy;Marchevsky, Natalie G;Jenkin, Dan... 38
2 30857579 2019-03-13 Brown, Charlie 1
3 28640023 2017-06-24 Thornton, Kevin C;Schwarz, Jennifer J;Gross, A... 12
4 24195874 2013-11-08 Bicket, Mark C;Gupta, Anita;Brown, Charlie H;C... 4
5 21741796 2011-07-12 Bird, Jonathan H;Carmont, Michael R;Dhillon, M... 7
6 21324873 2011-02-18 Cohen, Steven P;Brown, Charlie;Kurihara, Conni... 6
7 20228712 2010-03-17 Cohen, Steven P;Kapoor, Shruti G;Nguyen, Cuong... 8
8 20109957 2010-01-30 Cohen, Steven P;Brown, Charlie;Kurihara, Conni... 6
9 18248779 2008-02-06 Whitaker, Iain S;Duggan, Eileen M;Alloway, Rit... 10
10 16917639 2006-08-19 Drayton, William;Brown, Charlie;Hillhouse, Karin 3
11 16282488 2005-11-12 Mao, Hanwen;Lafont, Bernard A P;Igarashi, Tats... 9
12 14581571 2003-10-29 Moniuszko, Marcin;Brown, Charlie;Pal, Ranajit;... 7
13 12163382 2002-08-07 Williams, Kenneth;Schwartz, Annette;Corey, Sar... 10