识别从 Google Scholar 检索 href 的问题
Identifying issue in retrieving href from Google Scholar
从 google 学者那里抓取链接和文章名称时遇到问题。我不确定问题是出在我的代码上还是出在我用来检索数据的 xpath 上——或者两者都有?
过去几个小时我已经尝试 debug/consulting 其他 Whosebug 查询但没有成功。
import scrapy
from scrapyproj.items import ScrapyProjItem
class scholarScrape(scrapy.Spider):
name = "scholarScraper"
allowed_domains = "scholar.google.com"
start_urls=["https://scholar.google.com/scholar?hl=en&oe=ASCII&as_sdt=0%2C44&q=rare+disease+discovery&btnG="]
def parse(self,response):
item = ScrapyProjItem()
item['hyperlink'] = item.xpath("//h3[class=gs_rt]/a/@href").extract()
item['name'] = item.xpath("//div[@class='gs_rt']/h3").extract()
yield item
我收到的错误消息说:"AttributeError: xpath" 所以我认为问题出在我用来尝试检索数据的路径上,但我也可能弄错了吗?
添加我的评论作为答案,因为它解决了问题:
问题出在 scrapyproj.items.ScrapyProjItem
对象上:它们没有 xpath
属性。这是官方的 scrapy class 吗?我认为你打算在 response
:
上调用 xpath
item['hyperlink'] = response.xpath("//h3[class=gs_rt]/a/@href").extract()
item['name'] = response.xpath("//div[@class='gs_rt']/h3").extract()
此外,第一个路径表达式可能需要在属性值周围加上一组引号 "gs_rt":
item['hyperlink'] = response.xpath("//h3[class='gs_rt']/a/@href").extract()
除此之外,XPath 表达式没问题。
使用 bs4
的替代解决方案:
from bs4 import BeautifulSoup
import requests, lxml, os
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
html = requests.get('https://scholar.google.com/citations?hl=en&user=m8dFEawAAAAJ', headers=headers).text
soup = BeautifulSoup(html, 'lxml')
# Container where all articles located
for article_info in soup.select('#gsc_a_b .gsc_a_t'):
# title CSS selector
title = article_info.select_one('.gsc_a_at').text
# Same title CSS selector, except we're trying to get "data-href" attribute
# Note, it will be relative link, so you need to join it with absolute link after extracting.
title_link = article_info.select_one('.gsc_a_at')['data-href']
print(f'Title: {title}\nTitle link: https://scholar.google.com{title_link}\n')
# Part of the output:
'''
Title: Automating Gödel's Ontological Proof of God's Existence with Higher-order Automated Theorem Provers.
Title link: https://scholar.google.com/citations?view_op=view_citation&hl=en&user=m8dFEawAAAAJ&citation_for_view=m8dFEawAAAAJ:-f6ydRqryjwC
'''
或者,您可以使用来自 SerpApi 的 Google Scholar Author Articles API 执行相同的操作。
主要区别在于您不必考虑寻找好的代理,即使您使用 selenium
也不必尝试解决 CAPTCHA。这是一个付费 API 和免费计划。
要集成的代码:
from serpapi import GoogleSearch
import os
params = {
"api_key": os.getenv("API_KEY"),
"engine": "google_scholar_author",
"author_id": "9PepYk8AAAAJ",
}
search = GoogleSearch(params)
results = search.get_dict()
for article in results['articles']:
article_title = article['title']
article_link = article['link']
# Part of the output:
'''
Title: p-GaN gate HEMTs with tungsten gate metal for high threshold voltage and low gate current
Link: https://scholar.google.com/citations?view_op=view_citation&hl=en&user=9PepYk8AAAAJ&citation_for_view=9PepYk8AAAAJ:bUkhZ_yRbTwC
'''
Disclaimer, I work for SerpApi.
从 google 学者那里抓取链接和文章名称时遇到问题。我不确定问题是出在我的代码上还是出在我用来检索数据的 xpath 上——或者两者都有?
过去几个小时我已经尝试 debug/consulting 其他 Whosebug 查询但没有成功。
import scrapy
from scrapyproj.items import ScrapyProjItem
class scholarScrape(scrapy.Spider):
name = "scholarScraper"
allowed_domains = "scholar.google.com"
start_urls=["https://scholar.google.com/scholar?hl=en&oe=ASCII&as_sdt=0%2C44&q=rare+disease+discovery&btnG="]
def parse(self,response):
item = ScrapyProjItem()
item['hyperlink'] = item.xpath("//h3[class=gs_rt]/a/@href").extract()
item['name'] = item.xpath("//div[@class='gs_rt']/h3").extract()
yield item
我收到的错误消息说:"AttributeError: xpath" 所以我认为问题出在我用来尝试检索数据的路径上,但我也可能弄错了吗?
添加我的评论作为答案,因为它解决了问题:
问题出在 scrapyproj.items.ScrapyProjItem
对象上:它们没有 xpath
属性。这是官方的 scrapy class 吗?我认为你打算在 response
:
xpath
item['hyperlink'] = response.xpath("//h3[class=gs_rt]/a/@href").extract()
item['name'] = response.xpath("//div[@class='gs_rt']/h3").extract()
此外,第一个路径表达式可能需要在属性值周围加上一组引号 "gs_rt":
item['hyperlink'] = response.xpath("//h3[class='gs_rt']/a/@href").extract()
除此之外,XPath 表达式没问题。
使用 bs4
的替代解决方案:
from bs4 import BeautifulSoup
import requests, lxml, os
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
html = requests.get('https://scholar.google.com/citations?hl=en&user=m8dFEawAAAAJ', headers=headers).text
soup = BeautifulSoup(html, 'lxml')
# Container where all articles located
for article_info in soup.select('#gsc_a_b .gsc_a_t'):
# title CSS selector
title = article_info.select_one('.gsc_a_at').text
# Same title CSS selector, except we're trying to get "data-href" attribute
# Note, it will be relative link, so you need to join it with absolute link after extracting.
title_link = article_info.select_one('.gsc_a_at')['data-href']
print(f'Title: {title}\nTitle link: https://scholar.google.com{title_link}\n')
# Part of the output:
'''
Title: Automating Gödel's Ontological Proof of God's Existence with Higher-order Automated Theorem Provers.
Title link: https://scholar.google.com/citations?view_op=view_citation&hl=en&user=m8dFEawAAAAJ&citation_for_view=m8dFEawAAAAJ:-f6ydRqryjwC
'''
或者,您可以使用来自 SerpApi 的 Google Scholar Author Articles API 执行相同的操作。
主要区别在于您不必考虑寻找好的代理,即使您使用 selenium
也不必尝试解决 CAPTCHA。这是一个付费 API 和免费计划。
要集成的代码:
from serpapi import GoogleSearch
import os
params = {
"api_key": os.getenv("API_KEY"),
"engine": "google_scholar_author",
"author_id": "9PepYk8AAAAJ",
}
search = GoogleSearch(params)
results = search.get_dict()
for article in results['articles']:
article_title = article['title']
article_link = article['link']
# Part of the output:
'''
Title: p-GaN gate HEMTs with tungsten gate metal for high threshold voltage and low gate current
Link: https://scholar.google.com/citations?view_op=view_citation&hl=en&user=9PepYk8AAAAJ&citation_for_view=9PepYk8AAAAJ:bUkhZ_yRbTwC
'''
Disclaimer, I work for SerpApi.