R:在 Rvest 中使用管道链命令抓取多个 url
R: Scrape multiple urls using pipechain commands in Rvest
我有一个包含多个 url 的 chr 列表。我想从这些网址中的每一个下载内容。
为了避免写出数百条命令,我希望使用 lapply 循环自动执行该过程。
但是我的命令returns出错了。是否可以从多个网址中抓取?
当前方法
长方法:可行,但我希望将其自动化
urls <-c("https://en.wikipedia.org/wiki/Belarus","https://en.wikipedia.org/wiki/Russia","https://en.wikipedia.org/wiki/England")
library(rvest)
library(httr) # required for user_agent command
uastring <- "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36"
session <- html_session("https://en.wikipedia.org/wiki/Main_Page", user_agent(uastring))
session2<-jump_to(session, "https://en.wikipedia.org/wiki/Belarus")
session3<-jump_to(session, "https://en.wikipedia.org/wiki/Russia")
writeBin(session2$response$content, "test1.txt")
writeBin(session3$response$content, "test2.txt")
Automated/loop: 不起作用。
urls <-c("https://en.wikipedia.org/wiki/Belarus","https://en.wikipedia.org/wiki/Russia","https://en.wikipedia.org/wiki/England")
library(rvest)
library(httr) # required for user_agent command
uastring <- "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36"
session <- html_session("https://en.wikipedia.org/wiki/Main_Page", user_agent(uastring))
lapply(urls, .%>% jump_to(session))
Error: is.session(x) is not TRUE
总结
我希望自动执行以下两个过程,jump_to()
和 writeBin()
,如下面的代码所示
session2<-jump_to(session, "https://en.wikipedia.org/wiki/Belarus")
session3<-jump_to(session, "https://en.wikipedia.org/wiki/Russia")
writeBin(session2$response$content, "test1.txt")
writeBin(session3$response$content, "test2.txt")
你可以这样做:
urls <-c("https://en.wikipedia.org/wiki/Belarus","https://en.wikipedia.org/wiki/Russia","https://en.wikipedia.org/wiki/England")
require(httr)
require(rvest)
uastring <- "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36"
session <- html_session("https://en.wikipedia.org/wiki/Main_Page", user_agent(uastring))
outfile <- sprintf("%s.html", sub(".*/", "", urls))
jump_and_write <- function(x, url, out_file){
tmp = jump_to(x, url)
writeBin(tmp$response$content, out_file)
}
for(i in seq_along(urls)){
jump_and_write(session, urls[i], outfile[i])
}
我有一个包含多个 url 的 chr 列表。我想从这些网址中的每一个下载内容。
为了避免写出数百条命令,我希望使用 lapply 循环自动执行该过程。
但是我的命令returns出错了。是否可以从多个网址中抓取?
当前方法
长方法:可行,但我希望将其自动化
urls <-c("https://en.wikipedia.org/wiki/Belarus","https://en.wikipedia.org/wiki/Russia","https://en.wikipedia.org/wiki/England")
library(rvest)
library(httr) # required for user_agent command
uastring <- "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36"
session <- html_session("https://en.wikipedia.org/wiki/Main_Page", user_agent(uastring))
session2<-jump_to(session, "https://en.wikipedia.org/wiki/Belarus")
session3<-jump_to(session, "https://en.wikipedia.org/wiki/Russia")
writeBin(session2$response$content, "test1.txt")
writeBin(session3$response$content, "test2.txt")
Automated/loop: 不起作用。
urls <-c("https://en.wikipedia.org/wiki/Belarus","https://en.wikipedia.org/wiki/Russia","https://en.wikipedia.org/wiki/England")
library(rvest)
library(httr) # required for user_agent command
uastring <- "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36"
session <- html_session("https://en.wikipedia.org/wiki/Main_Page", user_agent(uastring))
lapply(urls, .%>% jump_to(session))
Error: is.session(x) is not TRUE
总结
我希望自动执行以下两个过程,jump_to()
和 writeBin()
,如下面的代码所示
session2<-jump_to(session, "https://en.wikipedia.org/wiki/Belarus")
session3<-jump_to(session, "https://en.wikipedia.org/wiki/Russia")
writeBin(session2$response$content, "test1.txt")
writeBin(session3$response$content, "test2.txt")
你可以这样做:
urls <-c("https://en.wikipedia.org/wiki/Belarus","https://en.wikipedia.org/wiki/Russia","https://en.wikipedia.org/wiki/England")
require(httr)
require(rvest)
uastring <- "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36"
session <- html_session("https://en.wikipedia.org/wiki/Main_Page", user_agent(uastring))
outfile <- sprintf("%s.html", sub(".*/", "", urls))
jump_and_write <- function(x, url, out_file){
tmp = jump_to(x, url)
writeBin(tmp$response$content, out_file)
}
for(i in seq_along(urls)){
jump_and_write(session, urls[i], outfile[i])
}