R 无法从网上下载文件

R Cannot download a file from the web

我可以在浏览器中从这个网站下载一个文件 https://www.cmegroup.com/ftp/pub/settle/comex_future.csv

但是当我尝试以下操作时

url <- "https://www.cmegroup.com/ftp/pub/settle/comex_future.csv"

dest <- "C:\COMEXfut.csv"

download.file(url, dest)

我收到以下错误消息

Error in download.file(url, dest) : 
  cannot open URL 'https://www.cmegroup.com/ftp/pub/settle/comex_future.csv'
In addition: Warning message:
In download.file(url, dest) :
  InternetOpenUrl failed: 'The operation timed out'

即使我选择:

options(timeout = max(600, getOption("timeout")))

知道为什么会这样吗?谢谢!

这里的问题是您下载的站点需要一些额外的 headers。提供它们的最简单方法是使用 httr

library(httr)

url <- "https://www.cmegroup.com/ftp/pub/settle/comex_future.csv"
UA <- paste('Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:98.0)',
            'Gecko/20100101 Firefox/98.0')

res <- GET(url, add_headers(`User-Agent` = UA, Connection = 'keep-alive'))

下载不到一秒。

如果你想保存文件你可以

writeBin(res$content, 'myfile.csv')

或者,如果您只是想直接将数据读入 R,甚至不保存它,您可以这样做:

content(res)
#> Rows: 527 Columns: 20                                                                 
#>  0s-- Column specification ----------------------------------------------------------------
#> Delimiter: ","
#> chr (10): PRODUCT SYMBOL, CONTRACT MONTH, CONTRACT DAY, CONTRACT, PRODUCT DESCRIPTIO...
#> dbl (10): CONTRACT YEAR, OPEN, HIGH, LOW, LAST, SETTLE, EST. VOL, PRIOR SETTLE, PRIO...
#> 
#> i Use `spec()` to retrieve the full column specification for this data.
#> i Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 527 x 20
#>    `PRODUCT SYMBOL` `CONTRACT MONTH` `CONTRACT YEAR` `CONTRACT DAY` CONTRACT
#>    <chr>            <chr>                      <dbl> <chr>          <chr>   
#>  1 0GC              07                          2022 NA             0GCN22  
#>  2 4GC              03                          2022 NA             4GCH22  
#>  3 4GC              05                          2022 NA             4GCK22  
#>  4 4GC              06                          2022 NA             4GCM22  
#>  5 4GC              08                          2022 NA             4GCQ22  
#>  6 4GC              10                          2022 NA             4GCV22  
#>  7 4GC              12                          2022 NA             4GCZ22  
#>  8 4GC              02                          2023 NA             4GCG23  
#>  9 4GC              04                          2023 NA             4GCJ23  
#> 10 4GC              06                          2023 NA             4GCM23  
#> # ... with 517 more rows, and 15 more variables: PRODUCT DESCRIPTION <chr>, OPEN <dbl>,
#> #   HIGH <dbl>, HIGH AB INDICATOR <chr>, LOW <dbl>, LOW AB INDICATOR <chr>, LAST <dbl>,
#> #   LAST AB INDICATOR <chr>, SETTLE <dbl>, PT CHG <chr>, EST. VOL <dbl>,
#> #   PRIOR SETTLE <dbl>, PRIOR VOL <dbl>, PRIOR INT <dbl>, TRADEDATE <chr>