R 试图根据它们的存在与否对列表中的 url 进行排序

R trying to sort url in a list depending on their existence or not

我正在做一个从 https://www.hockey-reference.com/boxscores/. Actually I'me trying to get every table of a season. I've generated a list of urls composed by combining https://www.hockey-reference.com/boxscores/ with each date of the calendar and each team name like "https://www.hockey-reference.com/boxscores/20171005WSH.html

收集一些数据的项目

我已将每个 url 存入一个列表,但有些会导致 404 错误。我正在尝试将 "Curl package" 与函数 "url.exists" 一起使用,以了解是否会出现 404 错误并删除列表的 url。问题是列表中的每个 url(包括真正存在的 url)return FALSE 和 url.exists 在 for 循环中......我试过使用这个函数在控制台中使用 url.exists(my list[i]) 但它 returns FALSE.

这是我的代码:

library(rvest)
library(RCurl)
##### Variables ####
team_names = c("ANA","ARI","BOS","BUF","CAR","CGY","CHI","CBJ","COL","DAL","DET","EDM","FLA","LAK","MIN","MTL","NSH","NJD","NYI","NYR","OTT","PHI","PHX","PIT","SJS","STL","TBL","TOR","VAN","VGK","WPG","WSH")
S2017 = read.table(file = "2018_season", header = TRUE, sep = ",")
dates = as.character(S2017[,1])
#### formating des dates ####
for (i in 1:length(dates)) {
  dates[i] = gsub("-", "", dates[i])
}
dates = unique(dates)
##### generation des url ####
url_list = c()
for (j in 1:2) { #dates
  for (k in 1:length(team_names)) {
    print(k)
    url_site = paste("https://www.hockey-reference.com/boxscores/",dates[j],team_names[k],".html",sep="")
    url_list = rbind(url_site,url_list)
  }
}
url_list_raffined = c()
for (l in 1:40) {
  print(l)
  if (url.exists(url_list[l], .header = TRUE) == TRUE) {
    url_list_raffined = c(url_list_raffined,url_list[l])
  }
}

对我的问题有什么想法吗?

谢谢

您可以使用 httr 包代替 RCurl

library(httr)
library(rvest)
library(xml2)
resp <- httr::GET(url_address, httr::timeout(60)) 
if(resp$status_code==200) {
    html <- xml2::read_html(resp)
    txt <- rvest::html_text(rvest::html_nodes(html)) # or similar
    # save the results somewhere or do your operations..
}

这里url_address是您要下载的地址。也许您需要将其放入函数或循环中以遍历所有地址。