在使用 Selenium 抓取 Google Scholar 时提取 .text returns 一个空字符串
Extracting .text returns an empty string while scraping Google Scholar with Selenium
我想抓取此页面上的信息 (https://scholar.google.com/citations?user=8Cuk5vYAAAAJ&hl=en#)。在页面上有一个条形图,显示每年的引用次数。我想在列表或 table 中同时抓取年份和引用,但到目前为止我还无法抓取引用次数,但年份。您对抓取和解析数据有什么建议吗?
提前致谢,
伊万
from selenium import webdriver
mozilla_path = r"C:\Users\ivrav\Python38\geckodriver.exe"
driver = webdriver.Firefox()
driver.get("https://scholar.google.com/citations?user=8Cuk5vYAAAAJ&hl=en")
driver.maximize_window()
Table=driver.find_element_by_xpath("""//*[@id="gsc_rsb_cit"]/div/div[3]/div""") .click()
Years=driver.find_elements_by_xpath("//span[@class='gsc_g_t']")
#for Year in Years:
# print(Year.text)
Citations=driver.find_elements_by_xpath("//a[@class='gsc_g_a']")
#for Citation in Citations:
# print(Citation)
page_items=len(Years)
for i in range(page_items):
print(Years(i).text , " : " , Citations(i).text)
driver.close()
该元素未显示在页面上,因此 .text 不会提取它(参见 )。
您可以使用有效的 .get_attribute("textContent"):
import os
from time import sleep
from selenium.webdriver import Chrome
CHROME_DRIVER_PATH = os.getenv("CHROME_DRIVER_PATH")
url = "https://scholar.google.com/citations?user=8Cuk5vYAAAAJ&hl=en"
driver = Chrome(executable_path=CHROME_DRIVER_PATH)
driver.maximize_window()
driver.get(url)
years = [element.get_attribute("textContent") for element in driver.find_elements_by_xpath('//span[@class="gsc_g_t"]')]
citations = [element.get_attribute("textContent") for element in driver.find_elements_by_xpath('//span[@class="gsc_g_al"]')]
for year, citation in zip(years, citations):
print(year, citation)
这是 Raphael Meudec 答案的附加解决方案,但使用 beautifulsoup
解决了这个问题。它包含与拉斐尔使用的几乎相同的代码和逻辑。
如果 headers 没有帮助,您需要将其与代理一起使用,否则,Google Scholar 将阻止请求,因为自动化脚本会发送请求。
为请求添加代理,假设您将使用 requests
库,例如:
proxies = {
'http': os.getenv('HTTP_PROXY')
}
# Request will be like so:
requests.get('google scholar link', proxies=proxies)
代码(full example在网上IDEbs4文件夹下->bs4_author_citedby_results):
from bs4 import BeautifulSoup
import requests, lxml, os, json
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
proxies = {
'http': os.getenv('HTTP_PROXY')
}
html = requests.get('https://scholar.google.com/citations?hl=en&user=m8dFEawAAAAJ', headers=headers, proxies=proxies).text
soup = BeautifulSoup(html, 'lxml')
# This is basically the same as Raphael Meudec suggested but using beautifulsoup
years = [graph_year.text for graph_year in soup.select('.gsc_g_t')]
citations = [graph_citation.text for graph_citation in soup.select('.gsc_g_a')]
for year, citation in zip(years,citations):
print(f'{year} {citation}\n')
# Part of the output:
'''
2007 24
2008 30
2009 46
'''
或者您可以通过在 zip()
循环中添加 data.append()
来生成 JSON 输出:
data = []
for year, citation in zip(years,citations):
data.append({
'year': year,
'citation': citation,
})
print(json.dumps(data, indent=2))
# Part of the output:
'''
[
{
"year": "2007",
"citation": "24"
},
{
"year": "2008",
"citation": "30"
}
]
'''
或者,您可以使用 SerpApi 中的 Google Scholar Author Cited By API。这是付费 API,可免费试用 5,000 次搜索。
要集成的代码:
from serpapi import GoogleSearch
import os
params = {
"api_key": os.getenv("API_KEY"),
"engine": "google_scholar_author",
"author_id": "8Cuk5vYAAAAJ",
"hl": "en",
}
search = GoogleSearch(params)
results = search.get_dict()
for graph_results in results['cited_by']['graph']:
year = graph_results['year']
citations = graph_results['citations']
print(f'{year} {citations}\n')
# part of the output
# JSON output could be added in the same manner as with bs4 code.
'''
2007 24
2008 30
2009 46
'''
Disclaimer, I work for SerpApi.
我想抓取此页面上的信息 (https://scholar.google.com/citations?user=8Cuk5vYAAAAJ&hl=en#)。在页面上有一个条形图,显示每年的引用次数。我想在列表或 table 中同时抓取年份和引用,但到目前为止我还无法抓取引用次数,但年份。您对抓取和解析数据有什么建议吗? 提前致谢, 伊万
from selenium import webdriver
mozilla_path = r"C:\Users\ivrav\Python38\geckodriver.exe"
driver = webdriver.Firefox()
driver.get("https://scholar.google.com/citations?user=8Cuk5vYAAAAJ&hl=en")
driver.maximize_window()
Table=driver.find_element_by_xpath("""//*[@id="gsc_rsb_cit"]/div/div[3]/div""") .click()
Years=driver.find_elements_by_xpath("//span[@class='gsc_g_t']")
#for Year in Years:
# print(Year.text)
Citations=driver.find_elements_by_xpath("//a[@class='gsc_g_a']")
#for Citation in Citations:
# print(Citation)
page_items=len(Years)
for i in range(page_items):
print(Years(i).text , " : " , Citations(i).text)
driver.close()
该元素未显示在页面上,因此 .text 不会提取它(参见
您可以使用有效的 .get_attribute("textContent"):
import os
from time import sleep
from selenium.webdriver import Chrome
CHROME_DRIVER_PATH = os.getenv("CHROME_DRIVER_PATH")
url = "https://scholar.google.com/citations?user=8Cuk5vYAAAAJ&hl=en"
driver = Chrome(executable_path=CHROME_DRIVER_PATH)
driver.maximize_window()
driver.get(url)
years = [element.get_attribute("textContent") for element in driver.find_elements_by_xpath('//span[@class="gsc_g_t"]')]
citations = [element.get_attribute("textContent") for element in driver.find_elements_by_xpath('//span[@class="gsc_g_al"]')]
for year, citation in zip(years, citations):
print(year, citation)
这是 Raphael Meudec 答案的附加解决方案,但使用 beautifulsoup
解决了这个问题。它包含与拉斐尔使用的几乎相同的代码和逻辑。
如果 headers 没有帮助,您需要将其与代理一起使用,否则,Google Scholar 将阻止请求,因为自动化脚本会发送请求。
为请求添加代理,假设您将使用 requests
库,例如:
proxies = {
'http': os.getenv('HTTP_PROXY')
}
# Request will be like so:
requests.get('google scholar link', proxies=proxies)
代码(full example在网上IDEbs4文件夹下->bs4_author_citedby_results):
from bs4 import BeautifulSoup
import requests, lxml, os, json
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
proxies = {
'http': os.getenv('HTTP_PROXY')
}
html = requests.get('https://scholar.google.com/citations?hl=en&user=m8dFEawAAAAJ', headers=headers, proxies=proxies).text
soup = BeautifulSoup(html, 'lxml')
# This is basically the same as Raphael Meudec suggested but using beautifulsoup
years = [graph_year.text for graph_year in soup.select('.gsc_g_t')]
citations = [graph_citation.text for graph_citation in soup.select('.gsc_g_a')]
for year, citation in zip(years,citations):
print(f'{year} {citation}\n')
# Part of the output:
'''
2007 24
2008 30
2009 46
'''
或者您可以通过在 zip()
循环中添加 data.append()
来生成 JSON 输出:
data = []
for year, citation in zip(years,citations):
data.append({
'year': year,
'citation': citation,
})
print(json.dumps(data, indent=2))
# Part of the output:
'''
[
{
"year": "2007",
"citation": "24"
},
{
"year": "2008",
"citation": "30"
}
]
'''
或者,您可以使用 SerpApi 中的 Google Scholar Author Cited By API。这是付费 API,可免费试用 5,000 次搜索。
要集成的代码:
from serpapi import GoogleSearch
import os
params = {
"api_key": os.getenv("API_KEY"),
"engine": "google_scholar_author",
"author_id": "8Cuk5vYAAAAJ",
"hl": "en",
}
search = GoogleSearch(params)
results = search.get_dict()
for graph_results in results['cited_by']['graph']:
year = graph_results['year']
citations = graph_results['citations']
print(f'{year} {citations}\n')
# part of the output
# JSON output could be added in the same manner as with bs4 code.
'''
2007 24
2008 30
2009 46
'''
Disclaimer, I work for SerpApi.