Python：如何将 scholarly.search_pubs() 结果保存为数据框？

Question

我用下面的代码用scholarly.search_pubs()函数找了一篇文章：

search_query = scholarly.search_pubs('A Bayesian Analysis of the Style Goods Inventory Problem')
scholarly.pprint(next(search_query))

输出：

{'author_id': ['', ''],
 'bib': {'abstract': 'A style goods item has a finite selling period during '
                     'which the sales rate varies in a seasonal and, to some '
                     'extent, predictable fashion. There are only a limited '
                     'number of opportunities to purchase or manufacture the '
                     'style goods item, and the cost, in general, will depend '
                     'on the time at which the item is obtained. The unit '
                     'revenue achieved from sales of the item also varies '
                     'during the selling season, and, in particular, reaches '
                     'an appreciably lower terminal salvage value. Previous '
                     'work on this class of problem has assumed one of the '
                     'following:(a)',
         'author': ['GR Murray Jr', 'EA Silver'],
         'pub_year': '1966',
         'title': 'A Bayesian analysis of the style goods inventory problem',
         'venue': 'Management Science'},
 'citedby_url': '/scholar?cites=9014559854426428787&as_sdt=5,33&sciodt=0,33&hl=en',
 'filled': False,
 'gsrank': 1,
 'num_citations': 208,
 'pub_url': 'https://pubsonline.informs.org/doi/abs/10.1287/mnsc.12.11.785',
 'source': 'PUBLICATION_SEARCH_SNIPPET',
 'url_add_sclib': '/citations?hl=en&xsrf=&continue=/scholar%3Fq%3DA%2BBayesian%2BAnalysis%2Bof%2Bthe%2BStyle%2BGoods%2BInventory%2BProblem%26hl%3Den%26as_sdt%3D0,33&citilm=1&update_op=library_add&info=c5WVKW0mGn0J&ei=4DdoYri8IoySyASZk6HgCA&json=',
 'url_related_articles': '/scholar?q=related:c5WVKW0mGn0J:scholar.google.com/&scioq=A+Bayesian+Analysis+of+the+Style+Goods+Inventory+Problem&hl=en&as_sdt=0,33',
 'url_scholarbib': '/scholar?q=info:c5WVKW0mGn0J:scholar.google.com/&output=cite&scirp=0&hl=en'}

我想将此输出保存为 pandas 数据帧。有人可以帮我吗？

编辑(1)：谢谢你回答我的问题。

当我运行这段代码时：

data = next(search_query)
df = pd.json_normalize(data)

...它给出以下错误消息：

StopIteration                             Traceback (most recent call last)
<ipython-input-78-ef73437b55a5> in <module>
----> 1 data = next(search_query)
      2 df = pd.json_normalize(data)

~\Anaconda3\lib\site-packages\scholarly\publication_parser.py in __next__(self)
     91             return self.__next__()
     92         else:
---> 93             raise StopIteration
     94 
     95     # Pickle protocol
StopIteration:

跟进问题

我有一个 excel 文件，其中包含多篇文章的标题。我没有单独搜索每篇文章，而是将 excel 文件导入为数据框，并使用以下代码查找有关文章的信息：

for i in df['Title']:
    search_query_1 = scholarly.search_pubs(i)

现在，search_query_1 迭代器包含多篇文章。如何将它们保存为数据框？

Answer 1

尝试使用 pd.json_normalize

# python 3.8.9
# scholarly==1.6.0

search_query = scholarly.search_pubs('A Bayesian Analysis of the Style Goods Inventory Problem')
data = next(search_query)
# you can use data = list(search_query) to get the entire search back
df = pd.json_normalize(data)

#output
>>> df.T                                                                      
                                                                     0
container_type                                              Publication
source                     PublicationSource.PUBLICATION_SEARCH_SNIPPET
filled                                                            False
gsrank                                                                1
pub_url               https://pubsonline.informs.org/doi/abs/10.1287...
author_id                                                          [, ]
url_scholarbib        /scholar?q=info:c5WVKW0mGn0J:scholar.google.co...
url_add_sclib         /citations?hl=en&xsrf=&continue=/scholar%3Fq%3...
num_citations                                                       209
citedby_url           /scholar?cites=9014559854426428787&as_sdt=5,33...
url_related_articles  /scholar?q=related:c5WVKW0mGn0J:scholar.google...
bib.title             A Bayesian analysis of the style goods invento...
bib.author                                    [GR Murray Jr, EA Silver]
bib.pub_year                                                       1966
bib.venue                                            Management Science
bib.abstract          A style goods item has a finite selling period...
>>> df.columns
Index(['container_type', 'source', 'filled', 'gsrank', 'pub_url', 'author_id',
       'url_scholarbib', 'url_add_sclib', 'num_citations', 'citedby_url',
       'url_related_articles', 'bib.title', 'bib.author', 'bib.pub_year',
       'bib.venue', 'bib.abstract'],
      dtype='object')

收集迭代搜索并进行 json 规范化

处理多个标题的迭代

titles_to_search = list(df['Title'].unique())

dfs = []
for title_to_search in titles_to_search:
    search_query = scholarly.search_pubs(title_to_search)
    search_results = list(search_query)
    
    temp_df = pd.json_normalize(data=search_results)
    if not temp_df.empty:
        dfs += [temp_df]

total_search_df = pd.concat(dfs)

Python：如何将 scholarly.search_pubs() 结果保存为数据框？

Python: How do I save scholarly.search_pubs() result as a dataframe?

python

article

web-crawler

dataframe

google-scholar