使用 R 访问 FTP 服务器并下载文件导致状态“530 未登录”

Using R to access FTP Server and Download Files Results in Status "530 Not logged in"

我正在尝试做什么

我正在尝试从美国国家气候数据中心的 FTP 服务器下载多个天气数据文件,但在成功完成多个文件下载后 运行 遇到了错误消息的问题。

成功下载两个 station/year 组合后,我开始收到错误消息“530 未登录”。我试过从有问题的那一年开始 运行 并得到大致相同的结果。它下载了一两年的数据,然后停止并显示相同的关于未登录的错误消息。

工作示例

以下是一个工作示例(或不是),输出被截断并粘贴在下面。

options(timeout = 300)
ftp <- "ftp://ftp.ncdc.noaa.gov/pub/data/gsod/"
td <- tempdir()
station <– c("983240-99999", "983250-99999", "983270-99999", "983280-99999", "984260-41231", "984290-99999", "984300-99999", "984320-99999", "984330-99999")
years <- 1960:2016

for (i in years) {
  remote_file_list <- RCurl::getURL(
    paste0(ftp, "/", i, "/"), ftp.use.epsv = FALSE, ftplistonly = TRUE,
    crlf = TRUE, ssl.verifypeer = FALSE)
  remote_file_list <- strsplit(remote_file_list, "\r*\n")[[1]]

  file_list <- paste0(station, "-", i, ".op.gz")

  file_list <- file_list[file_list %in% remote_file_list]

  file_list <- paste0(ftp, i, "/", file_list)

  Map(function(ftp, dest) utils::download.file(url = ftp,
                                               destfile = dest, mode = "wb"),
      file_list, file.path(td, basename(file_list)))
}


trying URL 'ftp://ftp.ncdc.noaa.gov/pub/data/gsod/1960/983250-99999-1960.op.gz'
Content type 'unknown' length 7135 bytes
==================================================
downloaded 7135 bytes

...

trying URL 'ftp://ftp.ncdc.noaa.gov/pub/data/gsod/1961/984290-99999-1961.op.gz'
Content type 'unknown' length 7649 bytes
==================================================
downloaded 7649 bytes

trying URL 'ftp://ftp.ncdc.noaa.gov/pub/data/gsod/1962/983250-99999-1962.op.gz'
downloaded 0 bytes

 Error in utils::download.file(url = ftp, destfile = dest, mode = "wb") : 
 cannot download all files In addition: Warning message: 
 In utils::download.file(url = ftp, destfile = dest, mode = "wb") : 
 URL ftp://ftp.ncdc.noaa.gov/pub/data/gsod/1962/983250-99999-1962.op.gz':      
 status was '530 Not logged in'

我尝试过但尚未成功的不同方法和想法

到目前为止,我已经尝试在 for 循环中使用 Sys.sleep 和任何其他通过打开然后关闭连接等更慢地检索文件的方式来减慢请求速度。这令人费解,因为:i)它工作了一点然后停止并且它与特定的 year/station 组合本身无关; ii) 我可以使用几乎完全相同的代码并下载更大的年度全球天气数据文件,在这样的很长一段时间内不会出现任何错误; iii) 它并不总是在 1961 年到 1962 年之后停止,有时它在 1961 年开始时在 1960 年停止,等等,但它似乎在几年之间始终如一,而不是在我发现的范围内。

登录是匿名的,但您可以使用 userpwd "ftp:your@email.address"。到目前为止,我一直未能成功使用该方法来确保我已登录以下载站文件。

我认为您在使用此 FTP 服务器时需要更具防御性的策略:

library(curl)  # ++gd > RCurl
library(purrr) # consistent "data first" functional & piping idioms FTW
library(dplyr) # progress bar

# We'll use this to fill in the years
ftp_base <- "ftp://ftp.ncdc.noaa.gov/pub/data/gsod/%s/"

dir_list_handle <- new_handle(ftp_use_epsv=FALSE, dirlistonly=TRUE, crlf=TRUE,
                              ssl_verifypeer=FALSE, ftp_response_timeout=30)

# Since you, yourself, noted the server was perhaps behaving strangely or under load
# it's prbly a much better idea (and a practice of good netizenship) to cache the
# results somewhere predictable rather than a temporary, ephemeral directory
cache_dir <- "./gsod_cache"
dir.create(cache_dir, showWarnings=FALSE)

# Given the sporadic efficacy of server connection, we'll wrap our calls
# in safe & retry functions. Change this variable if you want to have it retry
# more times.
MAX_RETRIES <- 6

# Wrapping the memory fetcher (for dir listings)
s_curl_fetch_memory <- safely(curl_fetch_memory)
retry_cfm <- function(url, handle) {

  i <- 0
  repeat {
    i <- i + 1
    res <- s_curl_fetch_memory(url, handle=handle)
    if (!is.null(res$result)) return(res$result)
    if (i==MAX_RETRIES) { stop("Too many retries...server may be under load") }
  }

}

# Wrapping the disk writer (for the actual files)
# Note the use of the cache dir. It won't waste your bandwidth or the
# server's bandwidth or CPU if the file has already been retrieved.
s_curl_fetch_disk <- safely(curl_fetch_disk)
retry_cfd <- function(url, path) {

  # you should prbly be a bit more thorough than `basename` since
  # i think there are issues with the 1971 and 1972 filenames. 
  # Gotta leave some work up to the OP
  cache_file <- sprintf("%s/%s", cache_dir, basename(url))
  if (file.exists(cache_file)) return()

  i <- 0
  repeat {
    i <- i + 1
    if (i==6) { stop("Too many retries...server may be under load") }
    res <- s_curl_fetch_disk(url, cache_file)
    if (!is.null(res$result)) return()
  }

}

# the stations and years
station <- c("983240-99999", "983250-99999", "983270-99999", "983280-99999",
             "984260-41231", "984290-99999", "984300-99999", "984320-99999",
             "984330-99999")
years <- 1960:2016

# progress indicators are like bowties: cool
pb <- progress_estimated(length(years))
walk(years, function(yr) {

  # the year we're working on
  year_url <- sprintf(ftp_base, yr)

  # fetch the directory listing
  tmp <- retry_cfm(year_url, handle=dir_list_handle)
  con <- rawConnection(tmp$content)
  fils <- readLines(con)
  close(con)

  # sift out only the target stations
  map(station, ~grep(., fils, value=TRUE)) %>%
    keep(~length(.)>0) %>%
    flatten_chr() -> fils

  # grab the stations files
  walk(paste(year_url, fils, sep=""), retry_cfd)

  # tick off progress
  pb$tick()$print()

})

如果您希望能够 stop/esc/interrupt 下载,您可能还想在 curl 句柄中将 curl_interrupt 设置为 TRUE