rvest HTML table 抓取技术 return 空列表
rvest HTML table scraping techniques return empty lists
我在从 html 表中抓取数据时 rvest
取得了成功,但是,对于这个特定的网站 http://www.sanzarrugby.com/superrugby/competition-stats/2016-team-ranking/,当我 运行 代码
url <- "http://www.sanzarrugby.com/superrugby/competition-stats/2016-team-ranking/"
rankings <- url %>%
read_html %>%
html_nodes("table") %>%
html_table()
返回的只是一个空列表。可能出了什么问题?
这个站点的 "problem" 是它动态加载一个 javascript 文件,然后通过回调机制执行该文件以创建 JS 数据,然后从中构建 tables/vis .
获取数据的一种方法是 [R]Selenium,但这对许多人来说是有问题的。
另一种方法是使用浏览器的开发工具来查看 JS 请求,运行 "Copy as cURL"(通常是右键单击)然后使用一些 R-fu 来获取您想要的内容需要。由于这将返回 javascript,因此我们需要在最终转换 JSON.
之前进行一些处理
library(jsonlite)
library(curlconverter)
library(httr)
# this is the `Copy as cURL` result, but you can leave it in your clipboard
# and not do this in production. Read the `curlconverter` help for more info
CURL <- "curl 'http://omo.akamai.opta.net/competition.php?feed_type=ru3&competition=205&season_id=2016&user=USERNAME&psw=PASSWORD&jsoncallback=RU3_205_2016' -H 'DNT: 1' -H 'Accept-Encoding: gzip, deflate, sdch' -H 'Accept-Language: en-US,en;q=0.8' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.94 Safari/537.36 Vivaldi/1.1.453.54' -H 'Accept: */*' -H 'Referer: http://www.sanzarrugby.com/superrugby/competition-stats/2016-team-ranking/' -H 'Connection: keep-alive' -H 'If-Modified-Since: Wed, 11 May 2016 14:47:09 GMT' -H 'Cache-Control: max-age=0' --compressed"
req <- make_req(straighten(CURL))[[1]]
req
# that makes:
# httr::VERB(verb = "GET", url = "http://omo.akamai.opta.net/competition.php?feed_type=ru3&competition=205&season_id=2016&user=USERNAME&psw=PASSWORD&jsoncallback=RU3_205_2016",
# httr::add_headers(DNT = "1", `Accept-Encoding` = "gzip, deflate, sdch",
# `Accept-Language` = "en-US,en;q=0.8", `User-Agent` = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.94 Safari/537.36 Vivaldi/1.1.453.54",
# Accept = "*/*", Referer = "http://www.sanzarrugby.com/superrugby/competition-stats/2016-team-ranking/",
# Connection = "keep-alive", `If-Modified-Since` = "Wed, 11 May 2016 14:47:09 GMT",
# `Cache-Control` = "max-age=0"))
# which we can transform into the following after experimenting
URL <- "http://omo.akamai.opta.net/competition.php?feed_type=ru3&competition=205&season_id=2016&user=USERNAME&psw=PASSWORD&jsoncallback=RU3_205_2016"
pg <- GET(URL,
add_headers(
`User-Agent` = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.94 Safari/537.36 Vivaldi/1.1.453.54",
Referer = "http://www.sanzarrugby.com/superrugby/competition-stats/2016-team-ranking/"))
# now all we need to do is remove the callback
dat_from_json <- fromJSON(gsub(")$", "", gsub("^RU3_205_2016\(", "", content(pg, as="text"))), flatten=FALSE)
# we can also try removing the JSON callback, but it will return XML instead of JSON,
# which is fine since we can parse that easily
URL <- "http://omo.akamai.opta.net/competition.php?feed_type=ru3&competition=205&season_id=2016&user=USERNAME&psw=PASSWORD"
pg <- GET(URL,
add_headers(
`User-Agent` = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.94 Safari/537.36 Vivaldi/1.1.453.54",
Referer = "http://www.sanzarrugby.com/superrugby/competition-stats/2016-team-ranking/"))
xml_doc <- content(pg, as="parsed", encoding="UTF-8")
# but then you have to transform the XML, which I'll leave as an exercise to the OP :-)
我在从 html 表中抓取数据时 rvest
取得了成功,但是,对于这个特定的网站 http://www.sanzarrugby.com/superrugby/competition-stats/2016-team-ranking/,当我 运行 代码
url <- "http://www.sanzarrugby.com/superrugby/competition-stats/2016-team-ranking/"
rankings <- url %>%
read_html %>%
html_nodes("table") %>%
html_table()
返回的只是一个空列表。可能出了什么问题?
这个站点的 "problem" 是它动态加载一个 javascript 文件,然后通过回调机制执行该文件以创建 JS 数据,然后从中构建 tables/vis .
获取数据的一种方法是 [R]Selenium,但这对许多人来说是有问题的。
另一种方法是使用浏览器的开发工具来查看 JS 请求,运行 "Copy as cURL"(通常是右键单击)然后使用一些 R-fu 来获取您想要的内容需要。由于这将返回 javascript,因此我们需要在最终转换 JSON.
之前进行一些处理library(jsonlite)
library(curlconverter)
library(httr)
# this is the `Copy as cURL` result, but you can leave it in your clipboard
# and not do this in production. Read the `curlconverter` help for more info
CURL <- "curl 'http://omo.akamai.opta.net/competition.php?feed_type=ru3&competition=205&season_id=2016&user=USERNAME&psw=PASSWORD&jsoncallback=RU3_205_2016' -H 'DNT: 1' -H 'Accept-Encoding: gzip, deflate, sdch' -H 'Accept-Language: en-US,en;q=0.8' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.94 Safari/537.36 Vivaldi/1.1.453.54' -H 'Accept: */*' -H 'Referer: http://www.sanzarrugby.com/superrugby/competition-stats/2016-team-ranking/' -H 'Connection: keep-alive' -H 'If-Modified-Since: Wed, 11 May 2016 14:47:09 GMT' -H 'Cache-Control: max-age=0' --compressed"
req <- make_req(straighten(CURL))[[1]]
req
# that makes:
# httr::VERB(verb = "GET", url = "http://omo.akamai.opta.net/competition.php?feed_type=ru3&competition=205&season_id=2016&user=USERNAME&psw=PASSWORD&jsoncallback=RU3_205_2016",
# httr::add_headers(DNT = "1", `Accept-Encoding` = "gzip, deflate, sdch",
# `Accept-Language` = "en-US,en;q=0.8", `User-Agent` = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.94 Safari/537.36 Vivaldi/1.1.453.54",
# Accept = "*/*", Referer = "http://www.sanzarrugby.com/superrugby/competition-stats/2016-team-ranking/",
# Connection = "keep-alive", `If-Modified-Since` = "Wed, 11 May 2016 14:47:09 GMT",
# `Cache-Control` = "max-age=0"))
# which we can transform into the following after experimenting
URL <- "http://omo.akamai.opta.net/competition.php?feed_type=ru3&competition=205&season_id=2016&user=USERNAME&psw=PASSWORD&jsoncallback=RU3_205_2016"
pg <- GET(URL,
add_headers(
`User-Agent` = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.94 Safari/537.36 Vivaldi/1.1.453.54",
Referer = "http://www.sanzarrugby.com/superrugby/competition-stats/2016-team-ranking/"))
# now all we need to do is remove the callback
dat_from_json <- fromJSON(gsub(")$", "", gsub("^RU3_205_2016\(", "", content(pg, as="text"))), flatten=FALSE)
# we can also try removing the JSON callback, but it will return XML instead of JSON,
# which is fine since we can parse that easily
URL <- "http://omo.akamai.opta.net/competition.php?feed_type=ru3&competition=205&season_id=2016&user=USERNAME&psw=PASSWORD"
pg <- GET(URL,
add_headers(
`User-Agent` = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.94 Safari/537.36 Vivaldi/1.1.453.54",
Referer = "http://www.sanzarrugby.com/superrugby/competition-stats/2016-team-ranking/"))
xml_doc <- content(pg, as="parsed", encoding="UTF-8")
# but then you have to transform the XML, which I'll leave as an exercise to the OP :-)