运行 RSelenium 并行使用 Docker
Running RSelenium in parallel using Docker
我目前正在尝试使用包 doParallel 来并行化我的 RSelenium 网络抓取工具(运行ning on Docker)。我找到了这个 post (Speed up web scraping using multiplie Rselenium browsers) 并在此处复制@hdharrison 提供的答案:
library(RSelenium)
library(rvest)
library(magrittr)
library(foreach)
library(doParallel)
# using docker run -d -p 4445:4444 selenium/standalone-chrome:3.5.3
# in windows
URLsPar <- c("https://whosebug.com/", "https://github.com/",
"http://www.bbc.com/", "http://www.google.com",
"https://www.r-project.org/", "https://cran.r-project.org",
"https://twitter.com/", "https://www.facebook.com/")
appHTML <- c()
(cl <- (detectCores() - 1) %>% makeCluster) %>% registerDoParallel
# open a remoteDriver for each node on the cluster
clusterEvalQ(cl, {
library(RSelenium)
remDr <- remoteDriver(remoteServerAddr = "192.168.99.100", port = 4445L,
browserName = "chrome")
remDr$open()
})
ws <- foreach(x = 1:length(URLsPar),
.packages = c("rvest", "magrittr", "RSelenium")) %dopar% {
print(URLsPar[x])
remDr$navigate(URLsPar[x])
remDr$getTitle()[[1]]
}
> ws
[[1]]
[1] "Stack Overflow - Where Developers Learn, Share, & Build Careers"
[[2]]
[1] "The world's leading software development platform · GitHub"
[[3]]
[1] "BBC - Homepage"
[[4]]
[1] "Google"
[[5]]
[1] "R: The R Project for Statistical Computing"
[[6]]
[1] "The Comprehensive R Archive Network"
[[7]]
[1] "Twitter. It's what's happening."
[[8]]
[1] "Facebook - Log In or Sign Up"
# close browser on each node
clusterEvalQ(cl, {
remDr$close()
})
stopImplicitCluster()
这似乎是我正在寻找的解决方案,但是当我 运行 它时,我遇到了这个错误消息:
Error in checkForRemoteErrors(lapply(cl, recvResult)) :
3 nodes produced errors; first error: Undefined error in httr call. httr output: Failed to connect to 192.168.99.100 port 4445: Connection refused
这是 'docker ps' 输出:
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
f2d62f6b293b selenium/standalone-chrome:3.5.3 "/opt/bin/entry_poin…" 36 minutes ago Up 35 minutes 0.0.0.0:4445->4444/tcp recursing_austin
我知道我必须为每个内核打开一个新的浏览器,但我认为这就是问题所在:当我减少内核时,产生的错误就会减少。
如果我能提供更多细节,请告诉我!提前致谢!
与此同时,我找到了解决错误的方法。如果其他人面临同样的问题,我将在这里发表评论。我无法解释其背后的逻辑,但是当我替换
时,我的代码按预期运行
remDr <- remoteDriver(remoteServerAddr = "192.168.99.100", port = 4445L,
browserName = "chrome")
来自
remDr <- remoteDriver(port = 4445L)
并使用 Firefox 浏览器而不是 Chrome。
我目前正在尝试使用包 doParallel 来并行化我的 RSelenium 网络抓取工具(运行ning on Docker)。我找到了这个 post (Speed up web scraping using multiplie Rselenium browsers) 并在此处复制@hdharrison 提供的答案:
library(RSelenium)
library(rvest)
library(magrittr)
library(foreach)
library(doParallel)
# using docker run -d -p 4445:4444 selenium/standalone-chrome:3.5.3
# in windows
URLsPar <- c("https://whosebug.com/", "https://github.com/",
"http://www.bbc.com/", "http://www.google.com",
"https://www.r-project.org/", "https://cran.r-project.org",
"https://twitter.com/", "https://www.facebook.com/")
appHTML <- c()
(cl <- (detectCores() - 1) %>% makeCluster) %>% registerDoParallel
# open a remoteDriver for each node on the cluster
clusterEvalQ(cl, {
library(RSelenium)
remDr <- remoteDriver(remoteServerAddr = "192.168.99.100", port = 4445L,
browserName = "chrome")
remDr$open()
})
ws <- foreach(x = 1:length(URLsPar),
.packages = c("rvest", "magrittr", "RSelenium")) %dopar% {
print(URLsPar[x])
remDr$navigate(URLsPar[x])
remDr$getTitle()[[1]]
}
> ws
[[1]]
[1] "Stack Overflow - Where Developers Learn, Share, & Build Careers"
[[2]]
[1] "The world's leading software development platform · GitHub"
[[3]]
[1] "BBC - Homepage"
[[4]]
[1] "Google"
[[5]]
[1] "R: The R Project for Statistical Computing"
[[6]]
[1] "The Comprehensive R Archive Network"
[[7]]
[1] "Twitter. It's what's happening."
[[8]]
[1] "Facebook - Log In or Sign Up"
# close browser on each node
clusterEvalQ(cl, {
remDr$close()
})
stopImplicitCluster()
这似乎是我正在寻找的解决方案,但是当我 运行 它时,我遇到了这个错误消息:
Error in checkForRemoteErrors(lapply(cl, recvResult)) :
3 nodes produced errors; first error: Undefined error in httr call. httr output: Failed to connect to 192.168.99.100 port 4445: Connection refused
这是 'docker ps' 输出:
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
f2d62f6b293b selenium/standalone-chrome:3.5.3 "/opt/bin/entry_poin…" 36 minutes ago Up 35 minutes 0.0.0.0:4445->4444/tcp recursing_austin
我知道我必须为每个内核打开一个新的浏览器,但我认为这就是问题所在:当我减少内核时,产生的错误就会减少。
如果我能提供更多细节,请告诉我!提前致谢!
与此同时,我找到了解决错误的方法。如果其他人面临同样的问题,我将在这里发表评论。我无法解释其背后的逻辑,但是当我替换
时,我的代码按预期运行remDr <- remoteDriver(remoteServerAddr = "192.168.99.100", port = 4445L,
browserName = "chrome")
来自
remDr <- remoteDriver(port = 4445L)
并使用 Firefox 浏览器而不是 Chrome。