html_nodes 未检测到 Rvest 节点
Rvest nodes undetected with html_nodes
我不太明白为什么我不能在某些带有 Rvest 的网站上使用选择器。
示例:
url <- read_html("http://www.cbc.ca/news/politics")
headlines <- url %>%
html_nodes(".headline") %>%
html_text()
另一个例子:
library(RSelenium)
rD <- rsDriver(verbose = FALSE)
rD
remDr <- rD$client
url <- "http://www.cbc.ca/news/politics"
remDr$navigate(url)
remDr$getTitle()
remDr$getCurrentUrl()
webElem <- remDr$findElement(using = "class", value = 'headline')
webElem$getElementAttribute("class")
remDr$close()
rD$server$stop()
应该够简单了吧。当我查看结构时,标题在 class 标题下。上面有 class card-content、card-content-top 但 css 选择器和 xpath 的组合似乎都不起作用。
CSS 由于 selectr 包有一些问题(至少在 Debian 上),选择器可能无法在 rvest 中工作,请参阅此以获取更多信息:
https://github.com/sjp/selectr/issues/7
使用 SelectorGadget 和 Chrome 开发人员工具,我使用以下 xpath 从网页中查找和识别 'headlines'。有关如何找到正确的 xpath 的更多信息,请参见此处:
https://medium.com/@peterjgensler/functions-with-r-and-rvest-a-laymens-guide-acda42325a77
library('rvest')
library('magrittr')
url <- read_html("http://www.cbc.ca/news/politics")
headlines <- url %>%
html_nodes(xpath = '//*[contains(concat( " ", @class, " " ), concat( " ", "pinnableHeadline", " " ))]') %>%
html_text()
headlines[1]
"On Trudeau's 2nd trip to China, time may be ripe to advance free
trade"
headlines[2]
"Liberals want to be global leader on open government, but face complaints at home"
我不太明白为什么我不能在某些带有 Rvest 的网站上使用选择器。
示例:
url <- read_html("http://www.cbc.ca/news/politics")
headlines <- url %>%
html_nodes(".headline") %>%
html_text()
另一个例子:
library(RSelenium)
rD <- rsDriver(verbose = FALSE)
rD
remDr <- rD$client
url <- "http://www.cbc.ca/news/politics"
remDr$navigate(url)
remDr$getTitle()
remDr$getCurrentUrl()
webElem <- remDr$findElement(using = "class", value = 'headline')
webElem$getElementAttribute("class")
remDr$close()
rD$server$stop()
应该够简单了吧。当我查看结构时,标题在 class 标题下。上面有 class card-content、card-content-top 但 css 选择器和 xpath 的组合似乎都不起作用。
CSS 由于 selectr 包有一些问题(至少在 Debian 上),选择器可能无法在 rvest 中工作,请参阅此以获取更多信息: https://github.com/sjp/selectr/issues/7
使用 SelectorGadget 和 Chrome 开发人员工具,我使用以下 xpath 从网页中查找和识别 'headlines'。有关如何找到正确的 xpath 的更多信息,请参见此处: https://medium.com/@peterjgensler/functions-with-r-and-rvest-a-laymens-guide-acda42325a77
library('rvest')
library('magrittr')
url <- read_html("http://www.cbc.ca/news/politics")
headlines <- url %>%
html_nodes(xpath = '//*[contains(concat( " ", @class, " " ), concat( " ", "pinnableHeadline", " " ))]') %>%
html_text()
headlines[1]
"On Trudeau's 2nd trip to China, time may be ripe to advance free
trade"
headlines[2]
"Liberals want to be global leader on open government, but face complaints at home"