如何访问使用带有 rvest 的 RSelenium 抓取的页面?
How to access a page scraped using RSelenium with rvest?
我正在尝试抓取使用 angular.js 的网页。我的理解是,R 中唯一的选择是先使用 RSelenium 加载页面,然后再解析内容。但是,我发现 rvest
比 RSelenium 更直观地解析内容,因此我想尽可能少地使用 RSelenium,然后尽快切换到 rvest
。
到目前为止,我意识到我可能至少需要使用 RSelenium 连接并使用 htmlTreeParse
下载 html 代码。假设这是我输出的一部分:
structure(list(name = "div", attributes = structure(c("im_dialog_date",
"dialogMessage.dateText"), .Names = c("class", "ng-bind")), children = structure(list(
text = structure(list(name = "text", attributes = NULL, children = NULL,
namespace = NULL, namespaceDefinitions = NULL, value = "6:52 PM"), .Names = c("name",
"attributes", "children", "namespace", "namespaceDefinitions",
"value"), class = c("XMLTextNode", "XMLNode", "RXMLAbstractNode",
"XMLAbstractNode", "oldClass"))), .Names = "text"), namespace = NULL,
namespaceDefinitions = NULL), .Names = c("name", "attributes",
"children", "namespace", "namespaceDefinitions"), class = c("XMLNode",
"RXMLAbstractNode", "XMLAbstractNode", "oldClass"))
如何将它传递给 rvest::read_html()
?
如果您查看项目的 class,它是一个 XMLNode
,它是由 XML
包定义的 class。在其中,它为 toString
(奇怪的是,但不是 as.character
)定义了一个方法,允许您将节点转换为普通字符串,然后可以由 xml2::read_html
读取:
library(rvest)
#> Loading required package: xml2
node <- structure(list(name = "div", attributes = structure(c("im_dialog_date",
"dialogMessage.dateText"), .Names = c("class", "ng-bind")), children = structure(list(
text = structure(list(name = "text", attributes = NULL, children = NULL,
namespace = NULL, namespaceDefinitions = NULL, value = "6:52 PM"), .Names = c("name",
"attributes", "children", "namespace", "namespaceDefinitions",
"value"), class = c("XMLTextNode", "XMLNode", "RXMLAbstractNode",
"XMLAbstractNode", "oldClass"))), .Names = "text"), namespace = NULL,
namespaceDefinitions = NULL), .Names = c("name", "attributes",
"children", "namespace", "namespaceDefinitions"), class = c("XMLNode",
"RXMLAbstractNode", "XMLAbstractNode", "oldClass"))
node %>% XML::toString.XMLNode() %>% read_html()
#> {xml_document}
#> <html>
#> [1] <body><div class="im_dialog_date" ng-bind="dialogMessage.dateText">6 ...
也就是说,我通常只使用 RSelenium::remoteDriver
的 getPageSource()
方法来获取所有 HTML,然后用 rvest 轻松解析。
我正在尝试抓取使用 angular.js 的网页。我的理解是,R 中唯一的选择是先使用 RSelenium 加载页面,然后再解析内容。但是,我发现 rvest
比 RSelenium 更直观地解析内容,因此我想尽可能少地使用 RSelenium,然后尽快切换到 rvest
。
到目前为止,我意识到我可能至少需要使用 RSelenium 连接并使用 htmlTreeParse
下载 html 代码。假设这是我输出的一部分:
structure(list(name = "div", attributes = structure(c("im_dialog_date",
"dialogMessage.dateText"), .Names = c("class", "ng-bind")), children = structure(list(
text = structure(list(name = "text", attributes = NULL, children = NULL,
namespace = NULL, namespaceDefinitions = NULL, value = "6:52 PM"), .Names = c("name",
"attributes", "children", "namespace", "namespaceDefinitions",
"value"), class = c("XMLTextNode", "XMLNode", "RXMLAbstractNode",
"XMLAbstractNode", "oldClass"))), .Names = "text"), namespace = NULL,
namespaceDefinitions = NULL), .Names = c("name", "attributes",
"children", "namespace", "namespaceDefinitions"), class = c("XMLNode",
"RXMLAbstractNode", "XMLAbstractNode", "oldClass"))
如何将它传递给 rvest::read_html()
?
如果您查看项目的 class,它是一个 XMLNode
,它是由 XML
包定义的 class。在其中,它为 toString
(奇怪的是,但不是 as.character
)定义了一个方法,允许您将节点转换为普通字符串,然后可以由 xml2::read_html
读取:
library(rvest)
#> Loading required package: xml2
node <- structure(list(name = "div", attributes = structure(c("im_dialog_date",
"dialogMessage.dateText"), .Names = c("class", "ng-bind")), children = structure(list(
text = structure(list(name = "text", attributes = NULL, children = NULL,
namespace = NULL, namespaceDefinitions = NULL, value = "6:52 PM"), .Names = c("name",
"attributes", "children", "namespace", "namespaceDefinitions",
"value"), class = c("XMLTextNode", "XMLNode", "RXMLAbstractNode",
"XMLAbstractNode", "oldClass"))), .Names = "text"), namespace = NULL,
namespaceDefinitions = NULL), .Names = c("name", "attributes",
"children", "namespace", "namespaceDefinitions"), class = c("XMLNode",
"RXMLAbstractNode", "XMLAbstractNode", "oldClass"))
node %>% XML::toString.XMLNode() %>% read_html()
#> {xml_document}
#> <html>
#> [1] <body><div class="im_dialog_date" ng-bind="dialogMessage.dateText">6 ...
也就是说,我通常只使用 RSelenium::remoteDriver
的 getPageSource()
方法来获取所有 HTML,然后用 rvest 轻松解析。