如何设置正确的 RCurl 选项以从 NSE 网站下载
How to set the right RCurl options to download from NSE website
我正在尝试从 NSE 印度网站 (nseindia.com) 下载文件。问题是网站管理员不喜欢抓取程序从网站下载文件或读取页面。他们似乎有基于用户代理的限制。
我要下载的文件是http://www.nseindia.com/archives/equities/bhavcopy/pr/PR280815.zip
我可以使用
从 linux shell 下载这个
curl -v -A "Mozilla" http://www.nseindia.com/archives/equities/bhavcopy/pr/PR280815.zip
输出是这样的
About to connect() to www.nseindia.com port 80 (#0)
* Trying 115.112.4.12... % Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:--
--:--:-- 0connected
GET /archives/equities/bhavcopy/pr/PR280815.zip HTTP/1.1
User-Agent: Mozilla
Host: www.nseindia.com
Accept: /
< HTTP/1.1 200 OK < Server: Oracle-iPlanet-Web-Server/7.0 < Content-Length: 374691 < X-frame-options: SAMEORIGIN < Last-Modified:
Fri, 28 Aug 2015 12:20:02 GMT < ETag: "5b7a3-55e051f2" <
Accept-Ranges: bytes < Content-Type: application/zip < Date: Sat, 29
Aug 2015 17:56:05 GMT < Connection: keep-alive < { [data not shown] PK
5 365k 5 19977 0 0 34013 0 0:00:11 --:--:-- 0:00:11
56592
这样我就可以下载文件了。
我在 R Curl 中使用的代码是这样的
library("RCurl")
jurl <- "http://www.nseindia.com/archives/equities/bhavcopy/pr/PR280815.zip"
juseragent <- "Mozilla"
myOpts = curlOptions(verbose = TRUE, header = TRUE, useragent = juseragent)
jfile <- getURL(jurl,.opts=myOpts)
这也行不通。
我也曾尝试使用 base 库中的 download.file 并更改用户代理,但未成功。
如有任何帮助,我们将不胜感激。
library(curl) # this is not RCurl, you need to download curl
下载工作目录下的文件
curl_download("http://www.nseindia.com/archives/equities/bhavcopy/pr/PR280815.zip","tt.zip",handle = new_handle("useragent" = "my_user_agent"))
首先,您的问题不是设置用户代理,而是下载二进制数据。这有效:
jfile <- getURLContent(jurl, .opts=myOpts, binary=TRUE)
这是一个使用 httr
而不是 RCurl
的(更多)完整示例。
library(httr)
url <- "http://www.nseindia.com/archives/equities/bhavcopy/pr/PR280815.zip"
response <- GET(url, user_agent("Mozilla"))
response$status # 200 OK
# [1] 200
tf <- tempfile()
writeBin(content(response, "raw"), tf) # write response content (the zip file) to a temporary file
files <- unzip(tf, exdir=tempdir()) # unzips to system temp directory and returns a vector of file names
df.lst <- lapply(files[grepl("\.csv$",files)],read.csv) # convert .csv files to list of data.frames
head(df.lst[[2]])
# SYMBOL SERIES SECURITY HIGH.LOW INDEX.FLAG
# 1 AGRODUTCH EQ AGRO DUTCH INDUSTRIES LTD H NA
# 2 ALLSEC EQ ALLSEC TECHNOLOGIES LTD H NA
# 3 ALPA BE ALPA LABORATORIES LTD H NA
# 4 AMTL EQ ADV METERING TECH LTD H NA
# 5 ANIKINDS BE ANIK INDUSTRIES LTD H NA
# 6 ARSHIYA EQ ARSHIYA LIMITED H NA
我正在尝试从 NSE 印度网站 (nseindia.com) 下载文件。问题是网站管理员不喜欢抓取程序从网站下载文件或读取页面。他们似乎有基于用户代理的限制。
我要下载的文件是http://www.nseindia.com/archives/equities/bhavcopy/pr/PR280815.zip
我可以使用
从 linux shell 下载这个curl -v -A "Mozilla" http://www.nseindia.com/archives/equities/bhavcopy/pr/PR280815.zip
输出是这样的
About to connect() to www.nseindia.com port 80 (#0) * Trying 115.112.4.12... % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0connected
GET /archives/equities/bhavcopy/pr/PR280815.zip HTTP/1.1 User-Agent: Mozilla Host: www.nseindia.com Accept: / < HTTP/1.1 200 OK < Server: Oracle-iPlanet-Web-Server/7.0 < Content-Length: 374691 < X-frame-options: SAMEORIGIN < Last-Modified: Fri, 28 Aug 2015 12:20:02 GMT < ETag: "5b7a3-55e051f2" < Accept-Ranges: bytes < Content-Type: application/zip < Date: Sat, 29 Aug 2015 17:56:05 GMT < Connection: keep-alive < { [data not shown] PK 5 365k 5 19977 0 0 34013 0 0:00:11 --:--:-- 0:00:11 56592
这样我就可以下载文件了。
我在 R Curl 中使用的代码是这样的
library("RCurl")
jurl <- "http://www.nseindia.com/archives/equities/bhavcopy/pr/PR280815.zip"
juseragent <- "Mozilla"
myOpts = curlOptions(verbose = TRUE, header = TRUE, useragent = juseragent)
jfile <- getURL(jurl,.opts=myOpts)
这也行不通。
我也曾尝试使用 base 库中的 download.file 并更改用户代理,但未成功。
如有任何帮助,我们将不胜感激。
library(curl) # this is not RCurl, you need to download curl
下载工作目录下的文件
curl_download("http://www.nseindia.com/archives/equities/bhavcopy/pr/PR280815.zip","tt.zip",handle = new_handle("useragent" = "my_user_agent"))
首先,您的问题不是设置用户代理,而是下载二进制数据。这有效:
jfile <- getURLContent(jurl, .opts=myOpts, binary=TRUE)
这是一个使用 httr
而不是 RCurl
的(更多)完整示例。
library(httr)
url <- "http://www.nseindia.com/archives/equities/bhavcopy/pr/PR280815.zip"
response <- GET(url, user_agent("Mozilla"))
response$status # 200 OK
# [1] 200
tf <- tempfile()
writeBin(content(response, "raw"), tf) # write response content (the zip file) to a temporary file
files <- unzip(tf, exdir=tempdir()) # unzips to system temp directory and returns a vector of file names
df.lst <- lapply(files[grepl("\.csv$",files)],read.csv) # convert .csv files to list of data.frames
head(df.lst[[2]])
# SYMBOL SERIES SECURITY HIGH.LOW INDEX.FLAG
# 1 AGRODUTCH EQ AGRO DUTCH INDUSTRIES LTD H NA
# 2 ALLSEC EQ ALLSEC TECHNOLOGIES LTD H NA
# 3 ALPA BE ALPA LABORATORIES LTD H NA
# 4 AMTL EQ ADV METERING TECH LTD H NA
# 5 ANIKINDS BE ANIK INDUSTRIES LTD H NA
# 6 ARSHIYA EQ ARSHIYA LIMITED H NA