如何正确关闭连接,这样我就不会得到 "Error in file(con, "r") : all connections are in use" when using "readlines" and "tryCatch"

how to properly close connection so I won't get "Error in file(con, "r") : all connections are in use" when using "readlines" and "tryCatch"

我有一个来自特定域 (pixilink.com) 的 URL 列表(超过 4000 个),我想做的是弄清楚提供的域是图片还是视频。为此,我使用了此处提供的解决方案:How to write trycatch in R and 并编写了如下所示的代码:

#Function to get the value of initial_mode from the URL
urlmode <- function(x){
  mycontent <- readLines(x)
  mypos <- grep("initial_mode = ", mycontent)
  
  if(grepl("0", mycontent[mypos])){
    return("picture")
  } else if(grepl("tour", mycontent[mypos])){
    return("video")
  } else{
    return(NA)
  }
}

此外,为了防止不存在的 URL 出现错误,我使用了以下代码:

readUrl <- function(url) {
  out <- tryCatch(
    {
      readLines(con=url, warn=FALSE)
      return(1)    
    },
    error=function(cond) {
      return(NA)
    },
    warning=function(cond) {    
      return(NA)
    },
    finally={
      message( url)
    }
  )    
  return(out)
}

最后,我将 URL 的列表分开并传递给上述函数(例如,这里我使用了 URL 列表中的 1000 个值):

a <- subset(new_df, new_df$host=="www.pixilink.com")
vec <- a[['V']]
vec <- vec[1:1000] # only chose first 1000 rows

tt <- numeric(length(vec)) # checking validity of url
for (i in 1:length(vec)){
  tt[i] <- readUrl(vec[i])
  print(i)
}    
g <- data.frame(vec,tt)
g2 <- g[which(!is.na(g$tt)),] #only valid url

dd <- numeric(nrow(g2))
for (j in 1:nrow(g2)){
  dd[j] <- urlmode(g2[j,1])      
}    
Final <- cbind(g2,dd)
Final <- left_join(g, Final, by = c("vec" = "vec"))

我 运行 此代码在 URL 的示例列表中,有 100 个,URL 并且 它有效;然而,在我 运行 它在 URL 的整个列表中之后,它 返回了一个错误 。这是错误:Error in textConnection("rval", "w", local = TRUE) : all connections are in use Error in textConnection("rval", "w", local = TRUE) : all connections are in use

在此之后,甚至对于样本 URLs(我之前测试过的 100 个样本),我 运行 代码并收到此错误消息:Error in file(con, "r") : all connections are in use

我也试过closeAllConnection在每次调用循环中的每个函数之后,但是没有用。 谁能解释这个错误是什么?它与我们可以从网站获得的请求数量有关吗?有什么解决办法?

所以,我猜测为什么会发生这种情况是因为您没有关闭通过 tryCatch()urlmode() 通过使用 [=13= 打开的连接].我不确定 urlmode() 将如何在 中使用,所以它尽可能地简化了它(事后看来,这做得很糟糕,我很抱歉)。因此,我冒昧地重写 urlmode() 以尝试使其更加健壮,以应对手头似乎更广泛的任务。

我认为代码中的注释应该有所帮助,所以请看下面:

#Updated URL mode function with better 
#URL checking, connection handling,
#and "mode" investigation
urlmode <- function(x){
  
  #Check if URL is good to go
  if(!httr::http_error(x)){
    
    #Test cases
    #x <- "www.pixilink.com/3"
    #x <- "https://www.pixilink.com/93320"
    #x <- "https://www.pixilink.com/93313"
    
    #Then since there are redirect shenanigans
    #Get the actual URL the input points to
    #It should just be the input URL if there is
    #no redirection
    #This is important as this also takes care of
    #checking whether http or https need to be prefixed
    #in case the input URL is supplied without those
    #(this can cause problems for url() below)
    myx <- httr::HEAD(x)$url
    
    #Then check for what the default mode is
    mycon <- url(myx)
    open(mycon, "r")
    mycontent <- readLines(mycon)
    
    mypos <- grep("initial_mode = ", mycontent)
    
    #Close the connection since it's no longer
    #necessary
    close(mycon)
    
    #Some URLs with weird formats can return 
    #empty on this one since they don't
    #follow the expected format.
    #See for example: "https://www.pixilink.com/clients/899/#3"
    #which is actually
    #redirected from "https://www.pixilink.com/3"
    #After that, evaluate what's at mypos, and always 
    #return the actual URL
    #along with the result
    if(!purrr::is_empty(mypos)){
      
      #mystr<- stringr::str_extract(mycontent[mypos], "(?<=initial_mode\s\=).*")
      mystr <- stringr::str_extract(mycontent[mypos], "(?<=\').*(?=\')")
      return(c(myx, mystr))
      #return(mystr)
      
      #So once all that is done, check if the line at mypos
      #contains a 0 (picture), tour (video)
      #if(grepl("0", mycontent[mypos])){
      #  return(c(myx, "picture"))
        #return("picture")
      #} else if(grepl("tour", mycontent[mypos])){
      #  return(c(myx, "video"))
        #return("video")
      #}
      
    } else{
      #Valid URL but not interpretable
      return(c(myx, "uninterpretable"))
      #return("uninterpretable")
    }
    
  } else{
    #Straight up invalid URL
    #No myx variable to return here
    #Just x
    return(c(x, "invalid"))
    #return("invalid")
  }
  
}


#--------
#Sample code execution
library(purrr)
library(parallel)
library(future.apply)
library(httr)
library(stringr)
library(progressr)
library(progress)


#All future + progressr related stuff
#learned courtesy 
#
#Setting up parallelized execution
no_cores <- parallel::detectCores()
#The above setup will ensure ALL cores
#are put to use
clust <- parallel::makeCluster(no_cores)
future::plan(cluster, workers = clust)

#Progress bar for sanity checking
progressr::handlers(progressr::handler_progress(format="[:bar] :percent :eta :message"))


#Website's base URL
baseurl <- "https://www.pixilink.com"

#Using future_lapply() to recursively apply urlmode()
#to a sequence of the URLs on pixilink in parallel
#and storing the results in sitetype
#Using a future chunk size of 10
#Everything is wrapped in with_progress() to enable the
#progress bar

#
range <- 93310:93350
#range <- 1:10000
progressr::with_progress({
  myprog <- progressr::progressor(along = range)
  sitetype <- do.call(rbind, future_lapply(range, function(b, x){
    myprog() ##Progress bar signaller
    myurl <- paste0(b, "/", x)
    cat("\n", myurl, " ")
    myret <- urlmode(myurl)
    cat(myret, "\n")
    return(c(myurl, myret))
  }, b = baseurl, future.chunk.size = 10))
  
})




#Converting into a proper data.frame
#and assigning column names
sitetype <- data.frame(sitetype)
names(sitetype) <- c("given_url", "actual_url", "mode")

#A bit of wrangling to tidy up the mode column
sitetype$mode <- stringr::str_replace(sitetype$mode, "0", "picture")


head(sitetype)
#                        given_url                     actual_url        mode
# 1 https://www.pixilink.com/93310 https://www.pixilink.com/93310     invalid
# 2 https://www.pixilink.com/93311 https://www.pixilink.com/93311     invalid
# 3 https://www.pixilink.com/93312 https://www.pixilink.com/93312 floorplan2d
# 4 https://www.pixilink.com/93313 https://www.pixilink.com/93313     picture
# 5 https://www.pixilink.com/93314 https://www.pixilink.com/93314 floorplan2d
# 6 https://www.pixilink.com/93315 https://www.pixilink.com/93315        tour

unique(sitetype$mode)
# [1] "invalid"     "floorplan2d" "picture"     "tour" 

#--------

基本上,urlmode() 现在只在必要时打开和关闭连接,检查 URL 有效性,URL 重定向,并且“智能地”提取分配给 [=17 的值=].在 future.lapply()progressr 包中的进度条的帮助下,现在可以非常方便地并行应用所需的 pixilink.com/<integer> URLs。之后经过一些争论,结果可以非常整齐地呈现为 data.frame,如图所示。

例如,我已经在上面的代码中对小范围进行了演示。请注意此上下文中代码中注释掉的 1:10000 范围:我让此代码 运行 在这个(希望足够)大范围 URL 的最后几个小时内检查错误和问题。我可以证明我没有遇到任何错误(只有常规警告 In readLines(mycon) : incomplete final line found on 'https://www.pixilink.com/93334')。为了证明这一点,我将所有 10000 URL 的数据写入了一个 CSV 文件,我可以根据要求提供该文件(我不想不必要地将其上传到 pastebin 或其他地方)。由于我的疏忽,我忘记了 运行 的基准测试,但我想如果性能指标 desired/would 被认为是有趣的,我可以稍后再做。

出于您的目的,我相信您只需更改 [=25= 之前的 range 赋值,就可以简单地获取下面的整个代码片段并 运行 逐字记录(或进行修改) ] 一步到你喜欢的范围。我相信这种方法更简单,并且无需处理多个函数等(并且没有 tryCatch() 混乱需要处理)。