在使用 Selenium 抓取 Google Scholar 时提取 .text returns 一个空字符串

Extracting .text returns an empty string while scraping Google Scholar with Selenium

我想抓取此页面上的信息 (https://scholar.google.com/citations?user=8Cuk5vYAAAAJ&hl=en#)。在页面上有一个条形图,显示每年的引用次数。我想在列表或 table 中同时抓取年份和引用,但到目前为止我还无法抓取引用次数,但年份。您对抓取和解析数据有什么建议吗? 提前致谢, 伊万

from selenium import webdriver
mozilla_path = r"C:\Users\ivrav\Python38\geckodriver.exe"
driver = webdriver.Firefox()
driver.get("https://scholar.google.com/citations?user=8Cuk5vYAAAAJ&hl=en")
driver.maximize_window()
Table=driver.find_element_by_xpath("""//*[@id="gsc_rsb_cit"]/div/div[3]/div""") .click()
Years=driver.find_elements_by_xpath("//span[@class='gsc_g_t']")
#for Year in Years:
#   print(Year.text)
Citations=driver.find_elements_by_xpath("//a[@class='gsc_g_a']")
#for Citation in Citations:
#    print(Citation)
page_items=len(Years)
for i in range(page_items):
    print(Years(i).text , " : " , Citations(i).text)
driver.close()

该元素未显示在页面上,因此 .text 不会提取它(参见 )。

您可以使用有效的 .get_attribute("textContent"):

import os
from time import sleep
from selenium.webdriver import Chrome

CHROME_DRIVER_PATH = os.getenv("CHROME_DRIVER_PATH")

url = "https://scholar.google.com/citations?user=8Cuk5vYAAAAJ&hl=en"

driver = Chrome(executable_path=CHROME_DRIVER_PATH)
driver.maximize_window()
driver.get(url)

years = [element.get_attribute("textContent") for element in driver.find_elements_by_xpath('//span[@class="gsc_g_t"]')]
citations = [element.get_attribute("textContent") for element in driver.find_elements_by_xpath('//span[@class="gsc_g_al"]')]

for year, citation in zip(years, citations):
    print(year, citation)

这是 Raphael Meudec 答案的附加解决方案,但使用 beautifulsoup 解决了这个问题。它包含与拉斐尔使用的几乎相同的代码和逻辑。

如果 headers 没有帮助,您需要将其与代理一起使用,否则,Google Scholar 将阻止请求,因为自动化脚本会发送请求。

为请求添加代理,假设您将使用 requests 库,例如:

proxies = {
  'http': os.getenv('HTTP_PROXY')
}

# Request will be like so:
requests.get('google scholar link', proxies=proxies)

代码(full example在网上IDEbs4文件夹下->bs4_author_citedby_results):

from bs4 import BeautifulSoup
import requests, lxml, os, json

headers = {
    'User-agent':
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

proxies = {
  'http': os.getenv('HTTP_PROXY')
}


html = requests.get('https://scholar.google.com/citations?hl=en&user=m8dFEawAAAAJ', headers=headers, proxies=proxies).text
soup = BeautifulSoup(html, 'lxml')

# This is basically the same as Raphael Meudec suggested but using beautifulsoup
years = [graph_year.text for graph_year in soup.select('.gsc_g_t')]
citations = [graph_citation.text for graph_citation in soup.select('.gsc_g_a')]

for year, citation in zip(years,citations):
  print(f'{year} {citation}\n')


# Part of the output:
'''
2007 24

2008 30

2009 46
'''

或者您可以通过在 zip() 循环中添加 data.append() 来生成 JSON 输出:

data = []

for year, citation in zip(years,citations):

  data.append({
    'year': year,
    'citation': citation,
  })

print(json.dumps(data, indent=2))

# Part of the output:
'''
[
  {
    "year": "2007",
    "citation": "24"
  },
  {
    "year": "2008",
    "citation": "30"
  }
]
'''

或者,您可以使用 SerpApi 中的 Google Scholar Author Cited By API。这是付费 API,可免费试用 5,000 次搜索。

要集成的代码:

from serpapi import GoogleSearch
import os

params = {
  "api_key": os.getenv("API_KEY"),
  "engine": "google_scholar_author",
  "author_id": "8Cuk5vYAAAAJ",
  "hl": "en",
}

search = GoogleSearch(params)
results = search.get_dict()

for graph_results in results['cited_by']['graph']:
  year = graph_results['year']
  citations = graph_results['citations']
  print(f'{year} {citations}\n')

# part of the output
# JSON output could be added in the same manner as with bs4 code.
'''
2007 24

2008 30

2009 46
'''

Disclaimer, I work for SerpApi.