Python:如何将 scholarly.search_pubs() 结果保存为数据框?
Python: How do I save scholarly.search_pubs() result as a dataframe?
我用下面的代码用scholarly.search_pubs()函数找了一篇文章:
search_query = scholarly.search_pubs('A Bayesian Analysis of the Style Goods Inventory Problem')
scholarly.pprint(next(search_query))
输出:
{'author_id': ['', ''],
'bib': {'abstract': 'A style goods item has a finite selling period during '
'which the sales rate varies in a seasonal and, to some '
'extent, predictable fashion. There are only a limited '
'number of opportunities to purchase or manufacture the '
'style goods item, and the cost, in general, will depend '
'on the time at which the item is obtained. The unit '
'revenue achieved from sales of the item also varies '
'during the selling season, and, in particular, reaches '
'an appreciably lower terminal salvage value. Previous '
'work on this class of problem has assumed one of the '
'following:(a)',
'author': ['GR Murray Jr', 'EA Silver'],
'pub_year': '1966',
'title': 'A Bayesian analysis of the style goods inventory problem',
'venue': 'Management Science'},
'citedby_url': '/scholar?cites=9014559854426428787&as_sdt=5,33&sciodt=0,33&hl=en',
'filled': False,
'gsrank': 1,
'num_citations': 208,
'pub_url': 'https://pubsonline.informs.org/doi/abs/10.1287/mnsc.12.11.785',
'source': 'PUBLICATION_SEARCH_SNIPPET',
'url_add_sclib': '/citations?hl=en&xsrf=&continue=/scholar%3Fq%3DA%2BBayesian%2BAnalysis%2Bof%2Bthe%2BStyle%2BGoods%2BInventory%2BProblem%26hl%3Den%26as_sdt%3D0,33&citilm=1&update_op=library_add&info=c5WVKW0mGn0J&ei=4DdoYri8IoySyASZk6HgCA&json=',
'url_related_articles': '/scholar?q=related:c5WVKW0mGn0J:scholar.google.com/&scioq=A+Bayesian+Analysis+of+the+Style+Goods+Inventory+Problem&hl=en&as_sdt=0,33',
'url_scholarbib': '/scholar?q=info:c5WVKW0mGn0J:scholar.google.com/&output=cite&scirp=0&hl=en'}
我想将此输出保存为 pandas 数据帧。有人可以帮我吗?
编辑(1):
谢谢你回答我的问题。
当我运行这段代码时:
data = next(search_query)
df = pd.json_normalize(data)
...它给出以下错误消息:
StopIteration Traceback (most recent call last)
<ipython-input-78-ef73437b55a5> in <module>
----> 1 data = next(search_query)
2 df = pd.json_normalize(data)
~\Anaconda3\lib\site-packages\scholarly\publication_parser.py in __next__(self)
91 return self.__next__()
92 else:
---> 93 raise StopIteration
94
95 # Pickle protocol
StopIteration:
跟进问题
我有一个 excel 文件,其中包含多篇文章的标题。我没有单独搜索每篇文章,而是将 excel 文件导入为数据框,并使用以下代码查找有关文章的信息:
for i in df['Title']:
search_query_1 = scholarly.search_pubs(i)
现在,search_query_1 迭代器包含多篇文章。如何将它们保存为数据框?
尝试使用 pd.json_normalize
# python 3.8.9
# scholarly==1.6.0
search_query = scholarly.search_pubs('A Bayesian Analysis of the Style Goods Inventory Problem')
data = next(search_query)
# you can use data = list(search_query) to get the entire search back
df = pd.json_normalize(data)
#output
>>> df.T
0
container_type Publication
source PublicationSource.PUBLICATION_SEARCH_SNIPPET
filled False
gsrank 1
pub_url https://pubsonline.informs.org/doi/abs/10.1287...
author_id [, ]
url_scholarbib /scholar?q=info:c5WVKW0mGn0J:scholar.google.co...
url_add_sclib /citations?hl=en&xsrf=&continue=/scholar%3Fq%3...
num_citations 209
citedby_url /scholar?cites=9014559854426428787&as_sdt=5,33...
url_related_articles /scholar?q=related:c5WVKW0mGn0J:scholar.google...
bib.title A Bayesian analysis of the style goods invento...
bib.author [GR Murray Jr, EA Silver]
bib.pub_year 1966
bib.venue Management Science
bib.abstract A style goods item has a finite selling period...
>>> df.columns
Index(['container_type', 'source', 'filled', 'gsrank', 'pub_url', 'author_id',
'url_scholarbib', 'url_add_sclib', 'num_citations', 'citedby_url',
'url_related_articles', 'bib.title', 'bib.author', 'bib.pub_year',
'bib.venue', 'bib.abstract'],
dtype='object')
收集迭代搜索并进行 json 规范化
处理多个标题的迭代
titles_to_search = list(df['Title'].unique())
dfs = []
for title_to_search in titles_to_search:
search_query = scholarly.search_pubs(title_to_search)
search_results = list(search_query)
temp_df = pd.json_normalize(data=search_results)
if not temp_df.empty:
dfs += [temp_df]
total_search_df = pd.concat(dfs)
我用下面的代码用scholarly.search_pubs()函数找了一篇文章:
search_query = scholarly.search_pubs('A Bayesian Analysis of the Style Goods Inventory Problem')
scholarly.pprint(next(search_query))
输出:
{'author_id': ['', ''],
'bib': {'abstract': 'A style goods item has a finite selling period during '
'which the sales rate varies in a seasonal and, to some '
'extent, predictable fashion. There are only a limited '
'number of opportunities to purchase or manufacture the '
'style goods item, and the cost, in general, will depend '
'on the time at which the item is obtained. The unit '
'revenue achieved from sales of the item also varies '
'during the selling season, and, in particular, reaches '
'an appreciably lower terminal salvage value. Previous '
'work on this class of problem has assumed one of the '
'following:(a)',
'author': ['GR Murray Jr', 'EA Silver'],
'pub_year': '1966',
'title': 'A Bayesian analysis of the style goods inventory problem',
'venue': 'Management Science'},
'citedby_url': '/scholar?cites=9014559854426428787&as_sdt=5,33&sciodt=0,33&hl=en',
'filled': False,
'gsrank': 1,
'num_citations': 208,
'pub_url': 'https://pubsonline.informs.org/doi/abs/10.1287/mnsc.12.11.785',
'source': 'PUBLICATION_SEARCH_SNIPPET',
'url_add_sclib': '/citations?hl=en&xsrf=&continue=/scholar%3Fq%3DA%2BBayesian%2BAnalysis%2Bof%2Bthe%2BStyle%2BGoods%2BInventory%2BProblem%26hl%3Den%26as_sdt%3D0,33&citilm=1&update_op=library_add&info=c5WVKW0mGn0J&ei=4DdoYri8IoySyASZk6HgCA&json=',
'url_related_articles': '/scholar?q=related:c5WVKW0mGn0J:scholar.google.com/&scioq=A+Bayesian+Analysis+of+the+Style+Goods+Inventory+Problem&hl=en&as_sdt=0,33',
'url_scholarbib': '/scholar?q=info:c5WVKW0mGn0J:scholar.google.com/&output=cite&scirp=0&hl=en'}
我想将此输出保存为 pandas 数据帧。有人可以帮我吗?
编辑(1): 谢谢你回答我的问题。
当我运行这段代码时:
data = next(search_query)
df = pd.json_normalize(data)
...它给出以下错误消息:
StopIteration Traceback (most recent call last)
<ipython-input-78-ef73437b55a5> in <module>
----> 1 data = next(search_query)
2 df = pd.json_normalize(data)
~\Anaconda3\lib\site-packages\scholarly\publication_parser.py in __next__(self)
91 return self.__next__()
92 else:
---> 93 raise StopIteration
94
95 # Pickle protocol
StopIteration:
跟进问题
我有一个 excel 文件,其中包含多篇文章的标题。我没有单独搜索每篇文章,而是将 excel 文件导入为数据框,并使用以下代码查找有关文章的信息:
for i in df['Title']:
search_query_1 = scholarly.search_pubs(i)
现在,search_query_1 迭代器包含多篇文章。如何将它们保存为数据框?
尝试使用 pd.json_normalize
# python 3.8.9
# scholarly==1.6.0
search_query = scholarly.search_pubs('A Bayesian Analysis of the Style Goods Inventory Problem')
data = next(search_query)
# you can use data = list(search_query) to get the entire search back
df = pd.json_normalize(data)
#output
>>> df.T
0
container_type Publication
source PublicationSource.PUBLICATION_SEARCH_SNIPPET
filled False
gsrank 1
pub_url https://pubsonline.informs.org/doi/abs/10.1287...
author_id [, ]
url_scholarbib /scholar?q=info:c5WVKW0mGn0J:scholar.google.co...
url_add_sclib /citations?hl=en&xsrf=&continue=/scholar%3Fq%3...
num_citations 209
citedby_url /scholar?cites=9014559854426428787&as_sdt=5,33...
url_related_articles /scholar?q=related:c5WVKW0mGn0J:scholar.google...
bib.title A Bayesian analysis of the style goods invento...
bib.author [GR Murray Jr, EA Silver]
bib.pub_year 1966
bib.venue Management Science
bib.abstract A style goods item has a finite selling period...
>>> df.columns
Index(['container_type', 'source', 'filled', 'gsrank', 'pub_url', 'author_id',
'url_scholarbib', 'url_add_sclib', 'num_citations', 'citedby_url',
'url_related_articles', 'bib.title', 'bib.author', 'bib.pub_year',
'bib.venue', 'bib.abstract'],
dtype='object')
收集迭代搜索并进行 json 规范化
处理多个标题的迭代
titles_to_search = list(df['Title'].unique())
dfs = []
for title_to_search in titles_to_search:
search_query = scholarly.search_pubs(title_to_search)
search_results = list(search_query)
temp_df = pd.json_normalize(data=search_results)
if not temp_df.empty:
dfs += [temp_df]
total_search_df = pd.concat(dfs)