解析 Google 用 rvest 抓取的学术搜索结果

Question

我正在尝试使用 rvest 将 Google 学术搜索结果的一页抓取到作者、论文标题、年份和期刊标题的数据框中。

下面的简化、可重现的示例是在 Google 学术搜索中搜索示例术语 "apex predator conservation"。

的代码

注意：为了遵守服务条款，我只想处理通过手动搜索获得的搜索结果的第一页。我问的不是自动化来抓取额外的页面。

以下代码已经可以提取：

作者
论文题目
年

但它没有：

期刊名称

我想提取期刊标题并将其添加到输出中。

library(rvest)
library(xml2)
library(selectr)
library(stringr)
library(jsonlite)

url_name <- 'https://scholar.google.com/scholar?hl=en&as_sdt=0%2C38&q=apex+predator+conservation&btnG=&oq=apex+predator+c'
wp <- xml2::read_html(url_name)
# Extract raw data
titles <- rvest::html_text(rvest::html_nodes(wp, '.gs_rt'))
authors_years <- rvest::html_text(rvest::html_nodes(wp, '.gs_a'))
# Process data
authors <- gsub('^(.*?)\W+-\W+.*', '\1', authors_years, perl = TRUE)
years <- gsub('^.*(\d{4}).*', '\1', authors_years, perl = TRUE)
# Make data frame
df <- data.frame(titles = titles, authors = authors, years = years, stringsAsFactors = FALSE)

df

来源：

因此该代码的输出如下所示：

#>                                                                                                                                                   titles
#> 1                                                                                    [HTML][HTML] Saving large carnivores, but losing the apex predator?
#> 2                               Site fidelity and sex-specific migration in a mobile apex predator: implications for conservation and ecosystem dynamics
#> 3                  Effects of tourism-related provisioning on the trophic signatures and movement patterns of an apex predator, the Caribbean reef shark

#>                                           authors years
#> 1                  A Ordiz, R Bischof, JE Swenson  2013
#> 2  A Barnett, KG Abrantes, JD Stevens, JM Semmens  2011

两个问题：

如何添加从原始数据中提取期刊标题的列？
是否有我可以阅读和了解更多关于如何为自己提取其他字段的参考资料，这样我就不必在这里问了？

Answer 1

添加它们的一种方法是：

library(rvest)
library(xml2)
library(selectr)
library(stringr)
library(jsonlite)

url_name <- 'https://scholar.google.com/scholar?hl=en&as_sdt=0%2C38&q=apex+predator+conservation&btnG=&oq=apex+predator+c'
wp <- xml2::read_html(url_name)
# Extract raw data
titles <- rvest::html_text(rvest::html_nodes(wp, '.gs_rt'))
authors_years <- rvest::html_text(rvest::html_nodes(wp, '.gs_a'))
# Process data
authors <- gsub('^(.*?)\W+-\W+.*', '\1', authors_years, perl = TRUE)
years <- gsub('^.*(\d{4}).*', '\1', authors_years, perl = TRUE)


leftovers <- authors_years %>% 
  str_remove_all(authors) %>% 
  str_remove_all(years)


journals <- str_split(leftovers, "-") %>% 
            map_chr(2) %>% 
            str_extract_all("[:alpha:]*") %>% 
            map(function(x) x[x != ""]) %>% 
            map(~paste(., collapse = " ")) %>% 
            unlist()

# Make data frame
df <- data.frame(titles = titles, authors = authors, years = years, journals = journals, stringsAsFactors = FALSE)

关于你的第二个问题：css selector gadget chrome extension is nice for getting the css selectors of the elements you want. But in your case all elements share the same css class, so the only way to disentangle them is to use regex. So I guess learn a bit about css selectors and regex :)

解析 Google 用 rvest 抓取的学术搜索结果

parse Google Scholar search results scraped with rvest

html

r

stringr

rvest

xml2