从 Google 学术搜索结果中抓取和解析引文信息
Scraping and parsing citation info from Google Scholar search results
我有大约 20000 篇文章标题的列表,我想从 google 学者那里 抓取 他们的引用计数。我是 BeautifulSoup 图书馆的新手。我有这个代码:
import requests
from bs4 import BeautifulSoup
query = ['Role for migratory wild birds in the global spread of avian
influenza H5N8','Uncoupling conformational states from activity in an
allosteric enzyme','Technological Analysis of the World’s Earliest
Shamanic Costume: A Multi-Scalar, Experimental Study of a Red Deer
Headdress from the Early Holocene Site of Star Carr, North Yorkshire,
UK','Oxidative potential of PM 2.5 during Atlanta rush hour:
Measurements of in-vehicle dithiothreitol (DTT) activity','Primary
Prevention of CVD','Growth and Deposition of Au Nanoclusters on Polymer-
wrapped Graphene and Their Oxygen Reduction Activity','Relations of
Preschoolers Visual-Motor and Object Manipulation Skills With Executive
Function and Social Behavior','We Know Who Likes Us, but Not Who Competes
Against Us']
url = 'https://scholar.google.com/scholar?q=' + query + '&ie=UTF-8&oe=UTF-
8&hl=en&btnG=Search'
content = requests.get(url).text
page = BeautifulSoup(content, 'lxml')
results = []
for entry in page.find_all("h3", attrs={"class": "gs_rt"}):
results.append({"title": entry.a.text, "url": entry.a['href']})
但它 returns 只有标题和 url。我不知道如何从另一个标签中获取引文信息。请帮帮我。
您需要循环列表。您可以使用 Session 来提高效率。下面是 bs 4.7.1 的,它支持 :contains
伪 class 来查找引用计数。看起来您可以从 css selector 中删除 h3
类型 selector 并在 a
之前使用 class 即 .gs_rt a
.如果你没有 4.7.1。您可以改用 [title=Cite] + a
到 select 的引用计数。
import requests
from bs4 import BeautifulSoup as bs
queries = ['Role for migratory wild birds in the global spread of avian influenza H5N8',
'Uncoupling conformational states from activity in an allosteric enzyme',
'Technological Analysis of the World’s Earliest Shamanic Costume: A Multi-Scalar, Experimental Study of a Red Deer Headdress from the Early Holocene Site of Star Carr, North Yorkshire, UK',
'Oxidative potential of PM 2.5 during Atlanta rush hour: Measurements of in-vehicle dithiothreitol (DTT) activity',
'Primary Prevention of CVD','Growth and Deposition of Au Nanoclusters on Polymer-wrapped Graphene and Their Oxygen Reduction Activity',
'Relations of Preschoolers Visual-Motor and Object Manipulation Skills With Executive Function and Social Behavior',
'We Know Who Likes Us, but Not Who Competes Against Us']
with requests.Session() as s:
for query in queries:
url = 'https://scholar.google.com/scholar?q=' + query + '&ie=UTF-8&oe=UTF-8&hl=en&btnG=Search'
r = s.get(url)
soup = bs(r.content, 'lxml') # or 'html.parser'
title = soup.select_one('h3.gs_rt a').text if soup.select_one('h3.gs_rt a') is not None else 'No title'
link = soup.select_one('h3.gs_rt a')['href'] if title != 'No title' else 'No link'
citations = soup.select_one('a:contains("Cited by")').text if soup.select_one('a:contains("Cited by")') is not None else 'No citation count'
print(title, link, citations)
< 4.7.1 的替代方案。
with requests.Session() as s:
for query in queries:
url = 'https://scholar.google.com/scholar?q=' + query + '&ie=UTF-8&oe=UTF-8&hl=en&btnG=Search'
r = s.get(url)
soup = bs(r.content, 'lxml') # or 'html.parser'
title = soup.select_one('.gs_rt a')
if title is None:
title = 'No title'
link = 'No link'
else:
link = title['href']
title = title.text
citations = soup.select_one('[title=Cite] + a')
if citations is None:
citations = 'No citation count'
else:
citations = citations.text
print(title, link, citations)
底部版本 re-written 感谢@facelessuser 的评论。待比较的最高版本:
不在单行 if 语句中调用两次 select_one 可能会更有效。缓存模式构建时,不会缓存 returned 标记。我个人会将变量设置为 select_one 编辑的 return,然后,仅当变量为 None 时,才将其更改为 No link 或 No title 等。它没有那么紧凑,但会更有效率
[...]始终检查标签是否为 None: 而不仅仅是标签:。对于 selectors,这没什么大不了的,因为它们只会 return 标签,但是如果你曾经做过类似 for x in tag.descendants 的事情:你会得到文本节点(字符串)和标签和空字符串将评估为 false,即使它是有效节点。在这种情况下,最安全的做法是检查 None
我建议您搜索包含 <h3>
和引文(在 <div class="gs_rs>"
内)的标签,而不是查找所有 <h3>
标签,即查找所有 <div class="gs_ri">
标签。
那么从这些标签中,您应该能够得到您需要的一切:
query = ['Role for migratory wild birds in the global spread of avian influenza H5N8','Uncoupling conformational states from activity in an allosteric enzyme','Technological Analysis of the World’s Earliest Shamanic Costume: A Multi-Scalar, Experimental Study of a Red Deer Headdress from the Early Holocene Site of Star Carr, North Yorkshire, UK','Oxidative potential of PM 2.5 during Atlanta rush hour: Measurements of in-vehicle dithiothreitol (DTT) activity','Primary Prevention of CVD','Growth and Deposition of Au Nanoclusters on Polymer- wrapped Graphene and Their Oxygen Reduction Activity','Relations of Preschoolers Visual-Motor and Object Manipulation Skills With Executive Function and Social Behavior','We Know Who Likes Us, but Not Who Competes Against Us']
url = 'https://scholar.google.com/scholar?q=' + query + '&ie=UTF-8&oe=UTF-8&hl=en&btnG=Search'
content = requests.get(url).text
page = BeautifulSoup(content, 'lxml')
results = []
for entry in page.find_all("div", attrs={"class": "gs_ri"}): #tag containing both h3 and citation
results.append({"title": entry.h3.a.text, "url": entry.a['href'], "citation": entry.find("div", attrs={"class": "gs_rs"}).text})
确保您使用的是 user-agent
,因为默认请求 user-agent
是 python-requests
and Google might block your requests and you receive a different HTML with some sort of error that doesn't contain selectors you're trying to select. Check what's your user-agent
。
在提出请求时 rotate user-agents
也可能是个好主意。
代码和full example that scrapes much more in the online IDE:
from bs4 import BeautifulSoup
import requests, lxml
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
queries = ['Role for migratory wild birds in the global spread of avian influenza H5N8',
'Uncoupling conformational states from activity in an allosteric enzyme',
'Technological Analysis of the World’s Earliest Shamanic Costume: A Multi-Scalar, Experimental Study of a Red Deer Headdress from the Early Holocene Site of Star Carr, North Yorkshire, UK',
'Oxidative potential of PM 2.5 during Atlanta rush hour: Measurements of in-vehicle dithiothreitol (DTT) activity',
'Primary Prevention of CVD','Growth and Deposition of Au Nanoclusters on Polymer-wrapped Graphene and Their Oxygen Reduction Activity',
'Relations of Preschoolers Visual-Motor and Object Manipulation Skills With Executive Function and Social Behavior',
'We Know Who Likes Us, but Not Who Competes Against Us']
for query in queries:
params = {
"q": query,
"hl": "en",
}
html = requests.get('https://scholar.google.com/scholar', headers=headers, params=params, proxies=proxies).text
soup = BeautifulSoup(html, 'lxml')
# Container where all needed data is located
for result in soup.select('.gs_ri'):
title = result.select_one('.gs_rt').text
title_link = result.select_one('.gs_rt a')['href']
cited_by = result.select_one('#gs_res_ccl_mid .gs_nph+ a')['href']
cited_by_count = result.select_one('#gs_res_ccl_mid .gs_nph+ a').text.split(' ')[2]
print(f"{title}\n{title_link}\n{cited_by}\n{cited_by_count}\n")
或者,您可以使用 SerpApi 中的 Google Scholar Organic Results API 来实现相同的目的。这是付费 API 和免费计划。
你的情况的不同之处在于你只需要迭代结构化 JSON 并获取你想要的数据,而不是弄清楚为什么某些事情不能正常工作。
要集成的代码:
from serpapi import GoogleSearch
import os
queries = ['Role for migratory wild birds in the global spread of avian influenza H5N8',
'Uncoupling conformational states from activity in an allosteric enzyme',
'Technological Analysis of the World’s Earliest Shamanic Costume: A Multi-Scalar, Experimental Study of a Red Deer Headdress from the Early Holocene Site of Star Carr, North Yorkshire, UK',
'Oxidative potential of PM 2.5 during Atlanta rush hour: Measurements of in-vehicle dithiothreitol (DTT) activity',
'Primary Prevention of CVD','Growth and Deposition of Au Nanoclusters on Polymer-wrapped Graphene and Their Oxygen Reduction Activity',
'Relations of Preschoolers Visual-Motor and Object Manipulation Skills With Executive Function and Social Behavior',
'We Know Who Likes Us, but Not Who Competes Against Us']
for query in queries:
params = {
"api_key": os.getenv("API_KEY"),
"engine": "google_scholar",
"q": query,
}
search = GoogleSearch(params)
results = search.get_dict()
data = []
for result in results['organic_results']:
data.append({
'title': result['title'],
'link': result['link'],
'publication_info': result['publication_info']['summary'],
'snippet': result['snippet'],
'cited_by': result['inline_links']['cited_by']['link'],
'related_versions': result['inline_links']['related_pages_link'],
})
print(json.dumps(data, indent=2, ensure_ascii=False))
P.S - 我写了一篇博客 post,内容是关于如何使用视觉表示来抓取 Google Scholar 上的几乎所有内容。
Disclaimer, I work for SerpApi.
我有大约 20000 篇文章标题的列表,我想从 google 学者那里 抓取 他们的引用计数。我是 BeautifulSoup 图书馆的新手。我有这个代码:
import requests
from bs4 import BeautifulSoup
query = ['Role for migratory wild birds in the global spread of avian
influenza H5N8','Uncoupling conformational states from activity in an
allosteric enzyme','Technological Analysis of the World’s Earliest
Shamanic Costume: A Multi-Scalar, Experimental Study of a Red Deer
Headdress from the Early Holocene Site of Star Carr, North Yorkshire,
UK','Oxidative potential of PM 2.5 during Atlanta rush hour:
Measurements of in-vehicle dithiothreitol (DTT) activity','Primary
Prevention of CVD','Growth and Deposition of Au Nanoclusters on Polymer-
wrapped Graphene and Their Oxygen Reduction Activity','Relations of
Preschoolers Visual-Motor and Object Manipulation Skills With Executive
Function and Social Behavior','We Know Who Likes Us, but Not Who Competes
Against Us']
url = 'https://scholar.google.com/scholar?q=' + query + '&ie=UTF-8&oe=UTF-
8&hl=en&btnG=Search'
content = requests.get(url).text
page = BeautifulSoup(content, 'lxml')
results = []
for entry in page.find_all("h3", attrs={"class": "gs_rt"}):
results.append({"title": entry.a.text, "url": entry.a['href']})
但它 returns 只有标题和 url。我不知道如何从另一个标签中获取引文信息。请帮帮我。
您需要循环列表。您可以使用 Session 来提高效率。下面是 bs 4.7.1 的,它支持 :contains
伪 class 来查找引用计数。看起来您可以从 css selector 中删除 h3
类型 selector 并在 a
之前使用 class 即 .gs_rt a
.如果你没有 4.7.1。您可以改用 [title=Cite] + a
到 select 的引用计数。
import requests
from bs4 import BeautifulSoup as bs
queries = ['Role for migratory wild birds in the global spread of avian influenza H5N8',
'Uncoupling conformational states from activity in an allosteric enzyme',
'Technological Analysis of the World’s Earliest Shamanic Costume: A Multi-Scalar, Experimental Study of a Red Deer Headdress from the Early Holocene Site of Star Carr, North Yorkshire, UK',
'Oxidative potential of PM 2.5 during Atlanta rush hour: Measurements of in-vehicle dithiothreitol (DTT) activity',
'Primary Prevention of CVD','Growth and Deposition of Au Nanoclusters on Polymer-wrapped Graphene and Their Oxygen Reduction Activity',
'Relations of Preschoolers Visual-Motor and Object Manipulation Skills With Executive Function and Social Behavior',
'We Know Who Likes Us, but Not Who Competes Against Us']
with requests.Session() as s:
for query in queries:
url = 'https://scholar.google.com/scholar?q=' + query + '&ie=UTF-8&oe=UTF-8&hl=en&btnG=Search'
r = s.get(url)
soup = bs(r.content, 'lxml') # or 'html.parser'
title = soup.select_one('h3.gs_rt a').text if soup.select_one('h3.gs_rt a') is not None else 'No title'
link = soup.select_one('h3.gs_rt a')['href'] if title != 'No title' else 'No link'
citations = soup.select_one('a:contains("Cited by")').text if soup.select_one('a:contains("Cited by")') is not None else 'No citation count'
print(title, link, citations)
< 4.7.1 的替代方案。
with requests.Session() as s:
for query in queries:
url = 'https://scholar.google.com/scholar?q=' + query + '&ie=UTF-8&oe=UTF-8&hl=en&btnG=Search'
r = s.get(url)
soup = bs(r.content, 'lxml') # or 'html.parser'
title = soup.select_one('.gs_rt a')
if title is None:
title = 'No title'
link = 'No link'
else:
link = title['href']
title = title.text
citations = soup.select_one('[title=Cite] + a')
if citations is None:
citations = 'No citation count'
else:
citations = citations.text
print(title, link, citations)
底部版本 re-written 感谢@facelessuser 的评论。待比较的最高版本:
不在单行 if 语句中调用两次 select_one 可能会更有效。缓存模式构建时,不会缓存 returned 标记。我个人会将变量设置为 select_one 编辑的 return,然后,仅当变量为 None 时,才将其更改为 No link 或 No title 等。它没有那么紧凑,但会更有效率
[...]始终检查标签是否为 None: 而不仅仅是标签:。对于 selectors,这没什么大不了的,因为它们只会 return 标签,但是如果你曾经做过类似 for x in tag.descendants 的事情:你会得到文本节点(字符串)和标签和空字符串将评估为 false,即使它是有效节点。在这种情况下,最安全的做法是检查 None
我建议您搜索包含 <h3>
和引文(在 <div class="gs_rs>"
内)的标签,而不是查找所有 <h3>
标签,即查找所有 <div class="gs_ri">
标签。
那么从这些标签中,您应该能够得到您需要的一切:
query = ['Role for migratory wild birds in the global spread of avian influenza H5N8','Uncoupling conformational states from activity in an allosteric enzyme','Technological Analysis of the World’s Earliest Shamanic Costume: A Multi-Scalar, Experimental Study of a Red Deer Headdress from the Early Holocene Site of Star Carr, North Yorkshire, UK','Oxidative potential of PM 2.5 during Atlanta rush hour: Measurements of in-vehicle dithiothreitol (DTT) activity','Primary Prevention of CVD','Growth and Deposition of Au Nanoclusters on Polymer- wrapped Graphene and Their Oxygen Reduction Activity','Relations of Preschoolers Visual-Motor and Object Manipulation Skills With Executive Function and Social Behavior','We Know Who Likes Us, but Not Who Competes Against Us']
url = 'https://scholar.google.com/scholar?q=' + query + '&ie=UTF-8&oe=UTF-8&hl=en&btnG=Search'
content = requests.get(url).text
page = BeautifulSoup(content, 'lxml')
results = []
for entry in page.find_all("div", attrs={"class": "gs_ri"}): #tag containing both h3 and citation
results.append({"title": entry.h3.a.text, "url": entry.a['href'], "citation": entry.find("div", attrs={"class": "gs_rs"}).text})
确保您使用的是 user-agent
,因为默认请求 user-agent
是 python-requests
and Google might block your requests and you receive a different HTML with some sort of error that doesn't contain selectors you're trying to select. Check what's your user-agent
。
在提出请求时 rotate user-agents
也可能是个好主意。
代码和full example that scrapes much more in the online IDE:
from bs4 import BeautifulSoup
import requests, lxml
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
queries = ['Role for migratory wild birds in the global spread of avian influenza H5N8',
'Uncoupling conformational states from activity in an allosteric enzyme',
'Technological Analysis of the World’s Earliest Shamanic Costume: A Multi-Scalar, Experimental Study of a Red Deer Headdress from the Early Holocene Site of Star Carr, North Yorkshire, UK',
'Oxidative potential of PM 2.5 during Atlanta rush hour: Measurements of in-vehicle dithiothreitol (DTT) activity',
'Primary Prevention of CVD','Growth and Deposition of Au Nanoclusters on Polymer-wrapped Graphene and Their Oxygen Reduction Activity',
'Relations of Preschoolers Visual-Motor and Object Manipulation Skills With Executive Function and Social Behavior',
'We Know Who Likes Us, but Not Who Competes Against Us']
for query in queries:
params = {
"q": query,
"hl": "en",
}
html = requests.get('https://scholar.google.com/scholar', headers=headers, params=params, proxies=proxies).text
soup = BeautifulSoup(html, 'lxml')
# Container where all needed data is located
for result in soup.select('.gs_ri'):
title = result.select_one('.gs_rt').text
title_link = result.select_one('.gs_rt a')['href']
cited_by = result.select_one('#gs_res_ccl_mid .gs_nph+ a')['href']
cited_by_count = result.select_one('#gs_res_ccl_mid .gs_nph+ a').text.split(' ')[2]
print(f"{title}\n{title_link}\n{cited_by}\n{cited_by_count}\n")
或者,您可以使用 SerpApi 中的 Google Scholar Organic Results API 来实现相同的目的。这是付费 API 和免费计划。
你的情况的不同之处在于你只需要迭代结构化 JSON 并获取你想要的数据,而不是弄清楚为什么某些事情不能正常工作。
要集成的代码:
from serpapi import GoogleSearch
import os
queries = ['Role for migratory wild birds in the global spread of avian influenza H5N8',
'Uncoupling conformational states from activity in an allosteric enzyme',
'Technological Analysis of the World’s Earliest Shamanic Costume: A Multi-Scalar, Experimental Study of a Red Deer Headdress from the Early Holocene Site of Star Carr, North Yorkshire, UK',
'Oxidative potential of PM 2.5 during Atlanta rush hour: Measurements of in-vehicle dithiothreitol (DTT) activity',
'Primary Prevention of CVD','Growth and Deposition of Au Nanoclusters on Polymer-wrapped Graphene and Their Oxygen Reduction Activity',
'Relations of Preschoolers Visual-Motor and Object Manipulation Skills With Executive Function and Social Behavior',
'We Know Who Likes Us, but Not Who Competes Against Us']
for query in queries:
params = {
"api_key": os.getenv("API_KEY"),
"engine": "google_scholar",
"q": query,
}
search = GoogleSearch(params)
results = search.get_dict()
data = []
for result in results['organic_results']:
data.append({
'title': result['title'],
'link': result['link'],
'publication_info': result['publication_info']['summary'],
'snippet': result['snippet'],
'cited_by': result['inline_links']['cited_by']['link'],
'related_versions': result['inline_links']['related_pages_link'],
})
print(json.dumps(data, indent=2, ensure_ascii=False))
P.S - 我写了一篇博客 post,内容是关于如何使用视觉表示来抓取 Google Scholar 上的几乎所有内容。
Disclaimer, I work for SerpApi.