Scraping/accessing 来自输入字段的所有搜索结果
Scraping/accessing all search results from input field
我想使用 rvest
抓取 https://www.deutsche-biographie.de/。在此网页顶部的输入字段中,必须输入名称。相应的搜索结果会显示所有具有此姓名或相似姓名的人。
比如我输入了名字'Meier'然后使用下面的代码抓取了相应的搜索结果
library(rvest)
library(dplyr)
page = read_html(x = "https://www.deutsche-biographie.de/search?_csrf=45b6ee54-385e-4777-90bf-9067923e6a00&name=meier")
name = page %>% html_nodes(".media-heading a") %>% html_text()
information = page %>% html_nodes("div.media-body p") %>% html_text()
result = data.frame(name, information)
result$information <- result$information %>% trimws() %>% strsplit(split = ", \n") %>% lapply(trimws)
result <- tidyr::unnest_wider(result, information) %>%
rename(years = 2, profession = 3) %>%
tidyr::extract(years, into = c("year_of_birth", "year_of_death"), regex = "^.*?(\d{4}).*?\-\s(\d{4})")
places <- page %>% html_nodes("li.treffer-liste-elem") %>% html_attr("data-orte") %>% strsplit(";")
result$place_of_birth <- lapply(places, function(x) x[grepl("@geburt$", x)]) %>% unlist()
result$place_of_death <- lapply(places, function(x) x[grepl("@tod$", x)]) %>% unlist()
result$place_of_activity <- lapply(places, function(x) x[grepl("@wirk$", x)])
result <- result %>%
tidyr::extract(place_of_birth, into = c("place_of_birth", "place_of_birth_coord"), regex = "^(.*?)@(.*?)@.*$") %>%
tidyr::extract(place_of_death, into = c("place_of_death", "place_of_death_coord"), regex = "^(.*?)@(.*?)@.*$")
result
这里用的URL是"https://www.deutsche-biographie.de/search?_csrf=45b6ee54-385e-4777-90bf-9067923e6a00&name=meier"
,name=meier
是我手动输入的名字。有没有一种方法可以访问所有 names/search 结果而不必仅指定一个特定名称?
非常感谢您的任何提示!
更新解决方案:
正如@QHarr 所建议的,我插入了一个 for 循环,通过
遍历所有页面
for (page_result in seq( from = 1, to = 2369 )) {
link = paste0("https://www.deutsche-biographie.de/search?_csrf=8dc48621-c226-47b9-98b9-a9dfb2ab0ad8&name=*&number=",
page_result)
...}
所以整个代码如下
result_total = data.frame()
for (page_result in seq( from = 1, to = 2369 )) {
link = paste0("https://www.deutsche-biographie.de/search?_csrf=8dc48621-c226-47b9-98b9-a9dfb2ab0ad8&name=*&number=",
page_result)
download.file(link, destfile = "scrapedpage.html", quiet=TRUE)
#page = read_html("scrapedpage.html") #to prevent 'error in open.connection(x, "rb") : Timeout was reached'
page = read_html(link)
name = page %>% html_nodes(".media-heading a") %>% html_text()
information = page %>% html_nodes("div.media-body p") %>% html_text()
result = data.frame(name, information)
result$information <- result$information %>% trimws() %>% strsplit(split = ", \n") %>% lapply(trimws)
result <- tidyr::unnest_wider(result, information) %>%
rename(years = 2, profession = 3) %>%
tidyr::extract(years, into = c("year_of_birth", "year_of_death"), regex = "^.*?(\d{4}).*?\-\s(\d{4})")
places <- page %>% html_nodes("li.treffer-liste-elem") %>% html_attr("data-orte") %>% strsplit(";")
result$place_of_birth <- lapply(places, function(x) x[grepl("@geburt$", x)])
result$place_of_death <- lapply(places, function(x) x[grepl("@tod$", x)])
result$place_of_activity <- lapply(places, function(x) x[grepl("@wirk$", x)])
result <- result %>%
tidyr::extract(place_of_birth, into = c("place_of_birth", "place_of_birth_coord"), regex = "^(.*?)@(.*?)@.*$") %>%
tidyr::extract(place_of_death, into = c("place_of_death", "place_of_death_coord"), regex = "^(.*?)@(.*?)@.*$")
print(paste("Page:", page_result)) #track the page that R is currently looping over
result_total <- rbind(result_total, result)
}
result_total <- apply(result_total,2,as.character)
全部使用“*”运算符。但是您仍然需要按页检索结果
https://www.deutsche-biographie.de/search?_csrf=8dc48621-c226-47b9-98b9-a9dfb2ab0ad8&name=*
您可以从初始请求中获取总结果数,然后,给定结果以 10 个为一组,并且分页反映在 url 中,对 url 所需的所有页面发出请求=31=]总共10个批次。单个页面看起来像:
第 1 页:
https://www.deutsche-biographie.de/search?_csrf=8dc48621-c226-47b9-98b9-a9dfb2ab0ad8&name=*&number=0
....
第 11 页:
https://www.deutsche-biographie.de/search?_csrf=8dc48621-c226-47b9-98b9-a9dfb2ab0ad8&name=*&number=10
并行发出请求并收集结果。根据所需的请求总数考虑礼貌的等待时间。
我想使用 rvest
抓取 https://www.deutsche-biographie.de/。在此网页顶部的输入字段中,必须输入名称。相应的搜索结果会显示所有具有此姓名或相似姓名的人。
比如我输入了名字'Meier'然后使用下面的代码抓取了相应的搜索结果
library(rvest)
library(dplyr)
page = read_html(x = "https://www.deutsche-biographie.de/search?_csrf=45b6ee54-385e-4777-90bf-9067923e6a00&name=meier")
name = page %>% html_nodes(".media-heading a") %>% html_text()
information = page %>% html_nodes("div.media-body p") %>% html_text()
result = data.frame(name, information)
result$information <- result$information %>% trimws() %>% strsplit(split = ", \n") %>% lapply(trimws)
result <- tidyr::unnest_wider(result, information) %>%
rename(years = 2, profession = 3) %>%
tidyr::extract(years, into = c("year_of_birth", "year_of_death"), regex = "^.*?(\d{4}).*?\-\s(\d{4})")
places <- page %>% html_nodes("li.treffer-liste-elem") %>% html_attr("data-orte") %>% strsplit(";")
result$place_of_birth <- lapply(places, function(x) x[grepl("@geburt$", x)]) %>% unlist()
result$place_of_death <- lapply(places, function(x) x[grepl("@tod$", x)]) %>% unlist()
result$place_of_activity <- lapply(places, function(x) x[grepl("@wirk$", x)])
result <- result %>%
tidyr::extract(place_of_birth, into = c("place_of_birth", "place_of_birth_coord"), regex = "^(.*?)@(.*?)@.*$") %>%
tidyr::extract(place_of_death, into = c("place_of_death", "place_of_death_coord"), regex = "^(.*?)@(.*?)@.*$")
result
这里用的URL是"https://www.deutsche-biographie.de/search?_csrf=45b6ee54-385e-4777-90bf-9067923e6a00&name=meier"
,name=meier
是我手动输入的名字。有没有一种方法可以访问所有 names/search 结果而不必仅指定一个特定名称?
非常感谢您的任何提示!
更新解决方案: 正如@QHarr 所建议的,我插入了一个 for 循环,通过
遍历所有页面 for (page_result in seq( from = 1, to = 2369 )) {
link = paste0("https://www.deutsche-biographie.de/search?_csrf=8dc48621-c226-47b9-98b9-a9dfb2ab0ad8&name=*&number=",
page_result)
...}
所以整个代码如下
result_total = data.frame()
for (page_result in seq( from = 1, to = 2369 )) {
link = paste0("https://www.deutsche-biographie.de/search?_csrf=8dc48621-c226-47b9-98b9-a9dfb2ab0ad8&name=*&number=",
page_result)
download.file(link, destfile = "scrapedpage.html", quiet=TRUE)
#page = read_html("scrapedpage.html") #to prevent 'error in open.connection(x, "rb") : Timeout was reached'
page = read_html(link)
name = page %>% html_nodes(".media-heading a") %>% html_text()
information = page %>% html_nodes("div.media-body p") %>% html_text()
result = data.frame(name, information)
result$information <- result$information %>% trimws() %>% strsplit(split = ", \n") %>% lapply(trimws)
result <- tidyr::unnest_wider(result, information) %>%
rename(years = 2, profession = 3) %>%
tidyr::extract(years, into = c("year_of_birth", "year_of_death"), regex = "^.*?(\d{4}).*?\-\s(\d{4})")
places <- page %>% html_nodes("li.treffer-liste-elem") %>% html_attr("data-orte") %>% strsplit(";")
result$place_of_birth <- lapply(places, function(x) x[grepl("@geburt$", x)])
result$place_of_death <- lapply(places, function(x) x[grepl("@tod$", x)])
result$place_of_activity <- lapply(places, function(x) x[grepl("@wirk$", x)])
result <- result %>%
tidyr::extract(place_of_birth, into = c("place_of_birth", "place_of_birth_coord"), regex = "^(.*?)@(.*?)@.*$") %>%
tidyr::extract(place_of_death, into = c("place_of_death", "place_of_death_coord"), regex = "^(.*?)@(.*?)@.*$")
print(paste("Page:", page_result)) #track the page that R is currently looping over
result_total <- rbind(result_total, result)
}
result_total <- apply(result_total,2,as.character)
全部使用“*”运算符。但是您仍然需要按页检索结果
https://www.deutsche-biographie.de/search?_csrf=8dc48621-c226-47b9-98b9-a9dfb2ab0ad8&name=*
您可以从初始请求中获取总结果数,然后,给定结果以 10 个为一组,并且分页反映在 url 中,对 url 所需的所有页面发出请求=31=]总共10个批次。单个页面看起来像:
第 1 页:
https://www.deutsche-biographie.de/search?_csrf=8dc48621-c226-47b9-98b9-a9dfb2ab0ad8&name=*&number=0
....
第 11 页:
https://www.deutsche-biographie.de/search?_csrf=8dc48621-c226-47b9-98b9-a9dfb2ab0ad8&name=*&number=10
并行发出请求并收集结果。根据所需的请求总数考虑礼貌的等待时间。