如何设置正确的 RCurl 选项以从 NSE 网站下载

Question

我正在尝试从 NSE 印度网站 (nseindia.com) 下载文件。问题是网站管理员不喜欢抓取程序从网站下载文件或读取页面。他们似乎有基于用户代理的限制。

我要下载的文件是http://www.nseindia.com/archives/equities/bhavcopy/pr/PR280815.zip

我可以使用

从 linux shell 下载这个

curl -v -A "Mozilla" http://www.nseindia.com/archives/equities/bhavcopy/pr/PR280815.zip

输出是这样的

About to connect() to www.nseindia.com port 80 (#0) * Trying 115.112.4.12... % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0connected

GET /archives/equities/bhavcopy/pr/PR280815.zip HTTP/1.1 User-Agent: Mozilla Host: www.nseindia.com Accept: / < HTTP/1.1 200 OK < Server: Oracle-iPlanet-Web-Server/7.0 < Content-Length: 374691 < X-frame-options: SAMEORIGIN < Last-Modified: Fri, 28 Aug 2015 12:20:02 GMT < ETag: "5b7a3-55e051f2" < Accept-Ranges: bytes < Content-Type: application/zip < Date: Sat, 29 Aug 2015 17:56:05 GMT < Connection: keep-alive < { [data not shown] PK 5 365k 5 19977 0 0 34013 0 0:00:11 --:--:-- 0:00:11 56592

这样我就可以下载文件了。

我在 R Curl 中使用的代码是这样的

  library("RCurl")

  jurl <- "http://www.nseindia.com/archives/equities/bhavcopy/pr/PR280815.zip"
  juseragent <- "Mozilla"
  myOpts = curlOptions(verbose = TRUE, header = TRUE, useragent = juseragent)
  jfile <- getURL(jurl,.opts=myOpts)

这也行不通。

我也曾尝试使用 base 库中的 download.file 并更改用户代理，但未成功。

如有任何帮助，我们将不胜感激。

Answer 1

library(curl) # this is not RCurl, you need to download curl

下载工作目录下的文件

curl_download("http://www.nseindia.com/archives/equities/bhavcopy/pr/PR280815.zip","tt.zip",handle = new_handle("useragent" = "my_user_agent"))

Answer 2

首先，您的问题不是设置用户代理，而是下载二进制数据。这有效：

jfile <- getURLContent(jurl, .opts=myOpts, binary=TRUE)

这是一个使用 httr 而不是 RCurl 的（更多）完整示例。

library(httr)
url <- "http://www.nseindia.com/archives/equities/bhavcopy/pr/PR280815.zip"
response <- GET(url, user_agent("Mozilla"))
response$status                                          # 200 OK
# [1] 200
tf <- tempfile()
writeBin(content(response, "raw"), tf)                   # write response content (the zip file) to a temporary file
files <- unzip(tf, exdir=tempdir())                      # unzips to system temp directory and returns a vector of file names
df.lst <- lapply(files[grepl("\.csv$",files)],read.csv) # convert .csv files to list of data.frames
head(df.lst[[2]])
#      SYMBOL SERIES                  SECURITY HIGH.LOW INDEX.FLAG
# 1 AGRODUTCH     EQ AGRO DUTCH INDUSTRIES LTD        H         NA
# 2    ALLSEC     EQ   ALLSEC TECHNOLOGIES LTD        H         NA
# 3      ALPA     BE     ALPA LABORATORIES LTD        H         NA
# 4      AMTL     EQ     ADV METERING TECH LTD        H         NA
# 5  ANIKINDS     BE       ANIK INDUSTRIES LTD        H         NA
# 6   ARSHIYA     EQ           ARSHIYA LIMITED        H         NA

如何设置正确的 RCurl 选项以从 NSE 网站下载

How to set the right RCurl options to download from NSE website

curl

r

rcurl