如何正确关闭连接,这样我就不会得到 "Error in file(con, "r") : all connections are in use" when using "readlines" and "tryCatch"
how to properly close connection so I won't get "Error in file(con, "r") : all connections are in use" when using "readlines" and "tryCatch"
我有一个来自特定域 (pixilink.com) 的 URL 列表(超过 4000 个),我想做的是弄清楚提供的域是图片还是视频。为此,我使用了此处提供的解决方案:How to write trycatch in R and 并编写了如下所示的代码:
#Function to get the value of initial_mode from the URL
urlmode <- function(x){
mycontent <- readLines(x)
mypos <- grep("initial_mode = ", mycontent)
if(grepl("0", mycontent[mypos])){
return("picture")
} else if(grepl("tour", mycontent[mypos])){
return("video")
} else{
return(NA)
}
}
此外,为了防止不存在的 URL 出现错误,我使用了以下代码:
readUrl <- function(url) {
out <- tryCatch(
{
readLines(con=url, warn=FALSE)
return(1)
},
error=function(cond) {
return(NA)
},
warning=function(cond) {
return(NA)
},
finally={
message( url)
}
)
return(out)
}
最后,我将 URL 的列表分开并传递给上述函数(例如,这里我使用了 URL 列表中的 1000 个值):
a <- subset(new_df, new_df$host=="www.pixilink.com")
vec <- a[['V']]
vec <- vec[1:1000] # only chose first 1000 rows
tt <- numeric(length(vec)) # checking validity of url
for (i in 1:length(vec)){
tt[i] <- readUrl(vec[i])
print(i)
}
g <- data.frame(vec,tt)
g2 <- g[which(!is.na(g$tt)),] #only valid url
dd <- numeric(nrow(g2))
for (j in 1:nrow(g2)){
dd[j] <- urlmode(g2[j,1])
}
Final <- cbind(g2,dd)
Final <- left_join(g, Final, by = c("vec" = "vec"))
我 运行 此代码在 URL 的示例列表中,有 100 个,URL 并且 它有效;然而,在我 运行 它在 URL 的整个列表中之后,它 返回了一个错误 。这是错误:Error in textConnection("rval", "w", local = TRUE) : all connections are in use Error in textConnection("rval", "w", local = TRUE) : all connections are in use
在此之后,甚至对于样本 URLs(我之前测试过的 100 个样本),我 运行 代码并收到此错误消息:Error in file(con, "r") : all connections are in use
我也试过closeAllConnection
在每次调用循环中的每个函数之后,但是没有用。
谁能解释这个错误是什么?它与我们可以从网站获得的请求数量有关吗?有什么解决办法?
所以,我猜测为什么会发生这种情况是因为您没有关闭通过 tryCatch()
和 urlmode()
通过使用 [=13= 打开的连接].我不确定 urlmode()
将如何在 中使用,所以它尽可能地简化了它(事后看来,这做得很糟糕,我很抱歉)。因此,我冒昧地重写 urlmode()
以尝试使其更加健壮,以应对手头似乎更广泛的任务。
我认为代码中的注释应该有所帮助,所以请看下面:
#Updated URL mode function with better
#URL checking, connection handling,
#and "mode" investigation
urlmode <- function(x){
#Check if URL is good to go
if(!httr::http_error(x)){
#Test cases
#x <- "www.pixilink.com/3"
#x <- "https://www.pixilink.com/93320"
#x <- "https://www.pixilink.com/93313"
#Then since there are redirect shenanigans
#Get the actual URL the input points to
#It should just be the input URL if there is
#no redirection
#This is important as this also takes care of
#checking whether http or https need to be prefixed
#in case the input URL is supplied without those
#(this can cause problems for url() below)
myx <- httr::HEAD(x)$url
#Then check for what the default mode is
mycon <- url(myx)
open(mycon, "r")
mycontent <- readLines(mycon)
mypos <- grep("initial_mode = ", mycontent)
#Close the connection since it's no longer
#necessary
close(mycon)
#Some URLs with weird formats can return
#empty on this one since they don't
#follow the expected format.
#See for example: "https://www.pixilink.com/clients/899/#3"
#which is actually
#redirected from "https://www.pixilink.com/3"
#After that, evaluate what's at mypos, and always
#return the actual URL
#along with the result
if(!purrr::is_empty(mypos)){
#mystr<- stringr::str_extract(mycontent[mypos], "(?<=initial_mode\s\=).*")
mystr <- stringr::str_extract(mycontent[mypos], "(?<=\').*(?=\')")
return(c(myx, mystr))
#return(mystr)
#So once all that is done, check if the line at mypos
#contains a 0 (picture), tour (video)
#if(grepl("0", mycontent[mypos])){
# return(c(myx, "picture"))
#return("picture")
#} else if(grepl("tour", mycontent[mypos])){
# return(c(myx, "video"))
#return("video")
#}
} else{
#Valid URL but not interpretable
return(c(myx, "uninterpretable"))
#return("uninterpretable")
}
} else{
#Straight up invalid URL
#No myx variable to return here
#Just x
return(c(x, "invalid"))
#return("invalid")
}
}
#--------
#Sample code execution
library(purrr)
library(parallel)
library(future.apply)
library(httr)
library(stringr)
library(progressr)
library(progress)
#All future + progressr related stuff
#learned courtesy
#
#Setting up parallelized execution
no_cores <- parallel::detectCores()
#The above setup will ensure ALL cores
#are put to use
clust <- parallel::makeCluster(no_cores)
future::plan(cluster, workers = clust)
#Progress bar for sanity checking
progressr::handlers(progressr::handler_progress(format="[:bar] :percent :eta :message"))
#Website's base URL
baseurl <- "https://www.pixilink.com"
#Using future_lapply() to recursively apply urlmode()
#to a sequence of the URLs on pixilink in parallel
#and storing the results in sitetype
#Using a future chunk size of 10
#Everything is wrapped in with_progress() to enable the
#progress bar
#
range <- 93310:93350
#range <- 1:10000
progressr::with_progress({
myprog <- progressr::progressor(along = range)
sitetype <- do.call(rbind, future_lapply(range, function(b, x){
myprog() ##Progress bar signaller
myurl <- paste0(b, "/", x)
cat("\n", myurl, " ")
myret <- urlmode(myurl)
cat(myret, "\n")
return(c(myurl, myret))
}, b = baseurl, future.chunk.size = 10))
})
#Converting into a proper data.frame
#and assigning column names
sitetype <- data.frame(sitetype)
names(sitetype) <- c("given_url", "actual_url", "mode")
#A bit of wrangling to tidy up the mode column
sitetype$mode <- stringr::str_replace(sitetype$mode, "0", "picture")
head(sitetype)
# given_url actual_url mode
# 1 https://www.pixilink.com/93310 https://www.pixilink.com/93310 invalid
# 2 https://www.pixilink.com/93311 https://www.pixilink.com/93311 invalid
# 3 https://www.pixilink.com/93312 https://www.pixilink.com/93312 floorplan2d
# 4 https://www.pixilink.com/93313 https://www.pixilink.com/93313 picture
# 5 https://www.pixilink.com/93314 https://www.pixilink.com/93314 floorplan2d
# 6 https://www.pixilink.com/93315 https://www.pixilink.com/93315 tour
unique(sitetype$mode)
# [1] "invalid" "floorplan2d" "picture" "tour"
#--------
基本上,urlmode()
现在只在必要时打开和关闭连接,检查 URL 有效性,URL 重定向,并且“智能地”提取分配给 [=17 的值=].在 future.lapply()
和 progressr
包中的进度条的帮助下,现在可以非常方便地并行应用所需的 pixilink.com/<integer>
URLs。之后经过一些争论,结果可以非常整齐地呈现为 data.frame
,如图所示。
例如,我已经在上面的代码中对小范围进行了演示。请注意此上下文中代码中注释掉的 1:10000
范围:我让此代码 运行 在这个(希望足够)大范围 URL 的最后几个小时内检查错误和问题。我可以证明我没有遇到任何错误(只有常规警告 In readLines(mycon) : incomplete final line found on 'https://www.pixilink.com/93334'
)。为了证明这一点,我将所有 10000 URL 的数据写入了一个 CSV 文件,我可以根据要求提供该文件(我不想不必要地将其上传到 pastebin 或其他地方)。由于我的疏忽,我忘记了 运行 的基准测试,但我想如果性能指标 desired/would 被认为是有趣的,我可以稍后再做。
出于您的目的,我相信您只需更改 [=25= 之前的 range
赋值,就可以简单地获取下面的整个代码片段并 运行 逐字记录(或进行修改) ] 一步到你喜欢的范围。我相信这种方法更简单,并且无需处理多个函数等(并且没有 tryCatch()
混乱需要处理)。
我有一个来自特定域 (pixilink.com) 的 URL 列表(超过 4000 个),我想做的是弄清楚提供的域是图片还是视频。为此,我使用了此处提供的解决方案:How to write trycatch in R and
#Function to get the value of initial_mode from the URL
urlmode <- function(x){
mycontent <- readLines(x)
mypos <- grep("initial_mode = ", mycontent)
if(grepl("0", mycontent[mypos])){
return("picture")
} else if(grepl("tour", mycontent[mypos])){
return("video")
} else{
return(NA)
}
}
此外,为了防止不存在的 URL 出现错误,我使用了以下代码:
readUrl <- function(url) {
out <- tryCatch(
{
readLines(con=url, warn=FALSE)
return(1)
},
error=function(cond) {
return(NA)
},
warning=function(cond) {
return(NA)
},
finally={
message( url)
}
)
return(out)
}
最后,我将 URL 的列表分开并传递给上述函数(例如,这里我使用了 URL 列表中的 1000 个值):
a <- subset(new_df, new_df$host=="www.pixilink.com")
vec <- a[['V']]
vec <- vec[1:1000] # only chose first 1000 rows
tt <- numeric(length(vec)) # checking validity of url
for (i in 1:length(vec)){
tt[i] <- readUrl(vec[i])
print(i)
}
g <- data.frame(vec,tt)
g2 <- g[which(!is.na(g$tt)),] #only valid url
dd <- numeric(nrow(g2))
for (j in 1:nrow(g2)){
dd[j] <- urlmode(g2[j,1])
}
Final <- cbind(g2,dd)
Final <- left_join(g, Final, by = c("vec" = "vec"))
我 运行 此代码在 URL 的示例列表中,有 100 个,URL 并且 它有效;然而,在我 运行 它在 URL 的整个列表中之后,它 返回了一个错误 。这是错误:Error in textConnection("rval", "w", local = TRUE) : all connections are in use Error in textConnection("rval", "w", local = TRUE) : all connections are in use
在此之后,甚至对于样本 URLs(我之前测试过的 100 个样本),我 运行 代码并收到此错误消息:Error in file(con, "r") : all connections are in use
我也试过closeAllConnection
在每次调用循环中的每个函数之后,但是没有用。
谁能解释这个错误是什么?它与我们可以从网站获得的请求数量有关吗?有什么解决办法?
所以,我猜测为什么会发生这种情况是因为您没有关闭通过 tryCatch()
和 urlmode()
通过使用 [=13= 打开的连接].我不确定 urlmode()
将如何在 urlmode()
以尝试使其更加健壮,以应对手头似乎更广泛的任务。
我认为代码中的注释应该有所帮助,所以请看下面:
#Updated URL mode function with better
#URL checking, connection handling,
#and "mode" investigation
urlmode <- function(x){
#Check if URL is good to go
if(!httr::http_error(x)){
#Test cases
#x <- "www.pixilink.com/3"
#x <- "https://www.pixilink.com/93320"
#x <- "https://www.pixilink.com/93313"
#Then since there are redirect shenanigans
#Get the actual URL the input points to
#It should just be the input URL if there is
#no redirection
#This is important as this also takes care of
#checking whether http or https need to be prefixed
#in case the input URL is supplied without those
#(this can cause problems for url() below)
myx <- httr::HEAD(x)$url
#Then check for what the default mode is
mycon <- url(myx)
open(mycon, "r")
mycontent <- readLines(mycon)
mypos <- grep("initial_mode = ", mycontent)
#Close the connection since it's no longer
#necessary
close(mycon)
#Some URLs with weird formats can return
#empty on this one since they don't
#follow the expected format.
#See for example: "https://www.pixilink.com/clients/899/#3"
#which is actually
#redirected from "https://www.pixilink.com/3"
#After that, evaluate what's at mypos, and always
#return the actual URL
#along with the result
if(!purrr::is_empty(mypos)){
#mystr<- stringr::str_extract(mycontent[mypos], "(?<=initial_mode\s\=).*")
mystr <- stringr::str_extract(mycontent[mypos], "(?<=\').*(?=\')")
return(c(myx, mystr))
#return(mystr)
#So once all that is done, check if the line at mypos
#contains a 0 (picture), tour (video)
#if(grepl("0", mycontent[mypos])){
# return(c(myx, "picture"))
#return("picture")
#} else if(grepl("tour", mycontent[mypos])){
# return(c(myx, "video"))
#return("video")
#}
} else{
#Valid URL but not interpretable
return(c(myx, "uninterpretable"))
#return("uninterpretable")
}
} else{
#Straight up invalid URL
#No myx variable to return here
#Just x
return(c(x, "invalid"))
#return("invalid")
}
}
#--------
#Sample code execution
library(purrr)
library(parallel)
library(future.apply)
library(httr)
library(stringr)
library(progressr)
library(progress)
#All future + progressr related stuff
#learned courtesy
#
#Setting up parallelized execution
no_cores <- parallel::detectCores()
#The above setup will ensure ALL cores
#are put to use
clust <- parallel::makeCluster(no_cores)
future::plan(cluster, workers = clust)
#Progress bar for sanity checking
progressr::handlers(progressr::handler_progress(format="[:bar] :percent :eta :message"))
#Website's base URL
baseurl <- "https://www.pixilink.com"
#Using future_lapply() to recursively apply urlmode()
#to a sequence of the URLs on pixilink in parallel
#and storing the results in sitetype
#Using a future chunk size of 10
#Everything is wrapped in with_progress() to enable the
#progress bar
#
range <- 93310:93350
#range <- 1:10000
progressr::with_progress({
myprog <- progressr::progressor(along = range)
sitetype <- do.call(rbind, future_lapply(range, function(b, x){
myprog() ##Progress bar signaller
myurl <- paste0(b, "/", x)
cat("\n", myurl, " ")
myret <- urlmode(myurl)
cat(myret, "\n")
return(c(myurl, myret))
}, b = baseurl, future.chunk.size = 10))
})
#Converting into a proper data.frame
#and assigning column names
sitetype <- data.frame(sitetype)
names(sitetype) <- c("given_url", "actual_url", "mode")
#A bit of wrangling to tidy up the mode column
sitetype$mode <- stringr::str_replace(sitetype$mode, "0", "picture")
head(sitetype)
# given_url actual_url mode
# 1 https://www.pixilink.com/93310 https://www.pixilink.com/93310 invalid
# 2 https://www.pixilink.com/93311 https://www.pixilink.com/93311 invalid
# 3 https://www.pixilink.com/93312 https://www.pixilink.com/93312 floorplan2d
# 4 https://www.pixilink.com/93313 https://www.pixilink.com/93313 picture
# 5 https://www.pixilink.com/93314 https://www.pixilink.com/93314 floorplan2d
# 6 https://www.pixilink.com/93315 https://www.pixilink.com/93315 tour
unique(sitetype$mode)
# [1] "invalid" "floorplan2d" "picture" "tour"
#--------
基本上,urlmode()
现在只在必要时打开和关闭连接,检查 URL 有效性,URL 重定向,并且“智能地”提取分配给 [=17 的值=].在 future.lapply()
和 progressr
包中的进度条的帮助下,现在可以非常方便地并行应用所需的 pixilink.com/<integer>
URLs。之后经过一些争论,结果可以非常整齐地呈现为 data.frame
,如图所示。
例如,我已经在上面的代码中对小范围进行了演示。请注意此上下文中代码中注释掉的 1:10000
范围:我让此代码 运行 在这个(希望足够)大范围 URL 的最后几个小时内检查错误和问题。我可以证明我没有遇到任何错误(只有常规警告 In readLines(mycon) : incomplete final line found on 'https://www.pixilink.com/93334'
)。为了证明这一点,我将所有 10000 URL 的数据写入了一个 CSV 文件,我可以根据要求提供该文件(我不想不必要地将其上传到 pastebin 或其他地方)。由于我的疏忽,我忘记了 运行 的基准测试,但我想如果性能指标 desired/would 被认为是有趣的,我可以稍后再做。
出于您的目的,我相信您只需更改 [=25= 之前的 range
赋值,就可以简单地获取下面的整个代码片段并 运行 逐字记录(或进行修改) ] 一步到你喜欢的范围。我相信这种方法更简单,并且无需处理多个函数等(并且没有 tryCatch()
混乱需要处理)。