Scraping/accessing 来自输入字段的所有搜索结果

Scraping/accessing all search results from input field

我想使用 rvest 抓取 https://www.deutsche-biographie.de/。在此网页顶部的输入字段中,必须输入名称。相应的搜索结果会显示所有具有此姓名或相似姓名的人。

比如我输入了名字'Meier'然后使用下面的代码抓取了相应的搜索结果

library(rvest)
library(dplyr)

page = read_html(x = "https://www.deutsche-biographie.de/search?_csrf=45b6ee54-385e-4777-90bf-9067923e6a00&name=meier")
name = page %>% html_nodes(".media-heading a") %>% html_text()
information = page %>% html_nodes("div.media-body p") %>% html_text()
result = data.frame(name, information)
result$information <- result$information %>% trimws() %>% strsplit(split = ", \n") %>% lapply(trimws)
result <- tidyr::unnest_wider(result, information) %>%
  rename(years = 2, profession = 3) %>% 
  tidyr::extract(years, into = c("year_of_birth", "year_of_death"), regex = "^.*?(\d{4}).*?\-\s(\d{4})")

places <- page %>% html_nodes("li.treffer-liste-elem") %>% html_attr("data-orte") %>% strsplit(";")

result$place_of_birth <- lapply(places, function(x) x[grepl("@geburt$", x)]) %>% unlist()
result$place_of_death <- lapply(places, function(x) x[grepl("@tod$", x)]) %>% unlist()
result$place_of_activity <- lapply(places, function(x) x[grepl("@wirk$", x)])

result <- result %>% 
  tidyr::extract(place_of_birth, into = c("place_of_birth", "place_of_birth_coord"), regex = "^(.*?)@(.*?)@.*$") %>% 
  tidyr::extract(place_of_death, into = c("place_of_death", "place_of_death_coord"), regex = "^(.*?)@(.*?)@.*$")

result

这里用的URL是"https://www.deutsche-biographie.de/search?_csrf=45b6ee54-385e-4777-90bf-9067923e6a00&name=meier"name=meier是我手动输入的名字。有没有一种方法可以访问所有 names/search 结果而不必仅指定一个特定名称? 非常感谢您的任何提示!

更新解决方案: 正如@QHarr 所建议的,我插入了一个 for 循环,通过

遍历所有页面
    for (page_result in seq( from = 1, to = 2369 )) {
      link = paste0("https://www.deutsche-biographie.de/search?_csrf=8dc48621-c226-47b9-98b9-a9dfb2ab0ad8&name=*&number=",
                    page_result)
...}

所以整个代码如下

result_total = data.frame()

for (page_result in seq( from = 1, to = 2369 )) {
  link = paste0("https://www.deutsche-biographie.de/search?_csrf=8dc48621-c226-47b9-98b9-a9dfb2ab0ad8&name=*&number=",
                page_result)
  
  download.file(link, destfile = "scrapedpage.html", quiet=TRUE)
  #page = read_html("scrapedpage.html") #to prevent 'error in open.connection(x, "rb") : Timeout was reached'
  page = read_html(link)
  name = page %>% html_nodes(".media-heading a") %>% html_text()
  information = page %>% html_nodes("div.media-body p") %>% html_text()
  result = data.frame(name, information)
  result$information <- result$information %>% trimws() %>% strsplit(split = ", \n") %>% lapply(trimws)
  result <- tidyr::unnest_wider(result, information) %>%
    rename(years = 2, profession = 3) %>% 
    tidyr::extract(years, into = c("year_of_birth", "year_of_death"), regex = "^.*?(\d{4}).*?\-\s(\d{4})")
  
  places <- page %>% html_nodes("li.treffer-liste-elem") %>% html_attr("data-orte") %>% strsplit(";")
  
  result$place_of_birth <- lapply(places, function(x) x[grepl("@geburt$", x)])
  result$place_of_death <- lapply(places, function(x) x[grepl("@tod$", x)])
  result$place_of_activity <- lapply(places, function(x) x[grepl("@wirk$", x)])
  
  result <- result %>% 
    tidyr::extract(place_of_birth, into = c("place_of_birth", "place_of_birth_coord"), regex = "^(.*?)@(.*?)@.*$") %>% 
    tidyr::extract(place_of_death, into = c("place_of_death", "place_of_death_coord"), regex = "^(.*?)@(.*?)@.*$")
  
  print(paste("Page:", page_result)) #track the page that R is currently looping over
  result_total <- rbind(result_total, result)
}


result_total <- apply(result_total,2,as.character)

全部使用“*”运算符。但是您仍然需要按页检索结果

https://www.deutsche-biographie.de/search?_csrf=8dc48621-c226-47b9-98b9-a9dfb2ab0ad8&name=*

您可以从初始请求中获取总结果数,然后,给定结果以 10 个为一组,并且分页反映在 url 中,对 url 所需的所有页面发出请求=31=]总共10个批次。单个页面看起来像:

第 1 页:

https://www.deutsche-biographie.de/search?_csrf=8dc48621-c226-47b9-98b9-a9dfb2ab0ad8&name=*&number=0

....

第 11 页:

https://www.deutsche-biographie.de/search?_csrf=8dc48621-c226-47b9-98b9-a9dfb2ab0ad8&name=*&number=10


并行发出请求并收集结果。根据所需的请求总数考虑礼貌的等待时间。