使用 rvest 或 RCurl 或 httr 在 R 中抓取页面
Scrape a page in R using rvest or RCurl or httr
我想提取下面页面中的table
https://www.mcxindia.com/market-data/spot-market-price
我已经尝试过 rvest 和 RCurl,但在这两种情况下,下载的页面与我在浏览器中看到的不同。我假设存在某种我无法检测或跟踪的重定向形式
如有任何帮助,我们将不胜感激
PS:对phantomjs不感兴趣
这是我到目前为止尝试过的方法:
1. HTTR
base_url <- "https://www.mcxindia.com/market-data/spot-market-price"
ua <- "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8"
library(httr)
library(XML)
doc <- POST(base_url,user_agent(ua),set_cookies(`_ga` = "GA1.2.543290785.1505100652",`_gid`="GA1.2.1409943545.1505881384",`_gat`="1"))
doc <- htmlParse(doc)
poptable<-readHTMLTable(doc,which=7)
结果:未找到数据!!!
2。 RCurl
library(RCurl)
curl <- getCurlHandle()
curlSetOpt(curl = curl,
ssl.verifypeer = FALSE,
useragent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
timeout = 60,
followlocation = TRUE,
cookiejar = "./cookies",
cookiefile = "./cookies")
newDoc = getURL("https://www.mcxindia.com/market-data/spot-market-price", curl=curl)
newDoc <- htmlParse(newDoc)
poptable<-readHTMLTable(newDoc,which=7)
结果:未找到数据!!!
我也很想知道如何获取 excel 文件(参见小 excel 图标)
答案在这里
library(rvest)
library(stringi)
library(V8)
ctx <- v8()
pg <- read_html("https://www.mcxindia.com/market-data/spot-market-price")
html_nodes(pg, xpath=".//script[contains(., 'Data')]")[[1]] %>%
html_text() %>% stri_unescape_unicode() %>% stri_replace_all_fixed('\\', '')%>%
ctx$eval() -> ignore_the_blank_return_value
data <- ctx$get("vSMP")$Data[,c("Symbol","TodaysSpotPrice","Unit")]
尽情享受!!!
我想提取下面页面中的table
https://www.mcxindia.com/market-data/spot-market-price
我已经尝试过 rvest 和 RCurl,但在这两种情况下,下载的页面与我在浏览器中看到的不同。我假设存在某种我无法检测或跟踪的重定向形式
如有任何帮助,我们将不胜感激
PS:对phantomjs不感兴趣
这是我到目前为止尝试过的方法:
1. HTTR
base_url <- "https://www.mcxindia.com/market-data/spot-market-price"
ua <- "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8"
library(httr)
library(XML)
doc <- POST(base_url,user_agent(ua),set_cookies(`_ga` = "GA1.2.543290785.1505100652",`_gid`="GA1.2.1409943545.1505881384",`_gat`="1"))
doc <- htmlParse(doc)
poptable<-readHTMLTable(doc,which=7)
结果:未找到数据!!!
2。 RCurl
library(RCurl)
curl <- getCurlHandle()
curlSetOpt(curl = curl,
ssl.verifypeer = FALSE,
useragent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
timeout = 60,
followlocation = TRUE,
cookiejar = "./cookies",
cookiefile = "./cookies")
newDoc = getURL("https://www.mcxindia.com/market-data/spot-market-price", curl=curl)
newDoc <- htmlParse(newDoc)
poptable<-readHTMLTable(newDoc,which=7)
结果:未找到数据!!!
我也很想知道如何获取 excel 文件(参见小 excel 图标)
答案在这里
library(rvest)
library(stringi)
library(V8)
ctx <- v8()
pg <- read_html("https://www.mcxindia.com/market-data/spot-market-price")
html_nodes(pg, xpath=".//script[contains(., 'Data')]")[[1]] %>%
html_text() %>% stri_unescape_unicode() %>% stri_replace_all_fixed('\\', '')%>%
ctx$eval() -> ignore_the_blank_return_value
data <- ctx$get("vSMP")$Data[,c("Symbol","TodaysSpotPrice","Unit")]
尽情享受!!!