Rvest 抓取从 html_text() 返回的网页内容

Question

我正在尝试使用 rvest 包从网页中抓取（动态？）内容。我知道动态内容应该需要使用诸如 Selenium 或 PhantomJS 之类的工具。

但是我的实验让我相信我应该仍然能够使用标准的网络抓取 r 包找到我想要的内容 (rvest,httr,xml2 ).

对于这个例子，我将使用 google 地图网页。这是示例 url...

https://www.google.com/maps/dir/920+nc-16-br,+denver,+nc,+28037/2114+hwy+16,+denver,+nc,+28037/

如果您点击上面的超链接，它会将您带到一个示例网页。在这个例子中我想要的内容是网页左上角的地址“920 NC-16, Crumpler, NC 28617”和“2114 NC-16, Newton, NC 28658”。

使用 css 选择器或 xpath 的标准技术不起作用，这最初是有道理的，因为我认为此内容是动态的。

url<-"https://www.google.com/maps/dir/920+nc-16-br,+denver,+nc,+28037/2114+hwy+16,+denver,+nc,+28037/"
page<-read_html(url)

# The commands below all return {xml nodeset 0}
html_nodes(page,css=".tactile-searchbox-input")
html_nodes(page,css="#sb_ifc50 > input")
html_nodes(page,xpath='//*[contains(concat( " ", @class, " " ), concat( " ", "tactile-searchbox-input", " " ))]')

上面的命令 return "{xml nodeset 0}" 我认为这是动态生成此内容的结果，但这是我的困惑所在，如果我转换整个使用 html_text() 页面到文本我可以在值 returned.

中找到地址

html_text(read_html(url))
substring<-substr(x,33561-100,33561+300)

执行上述命令会生成具有以下值的子字符串，

"null,null,null,null,[null,null,null,null,null,null,null,[[[\"920 NC-16, Crumpler, NC 28617\",null,null,null,null,null,null,null,null,null,空，\"Nzm5FTtId895YoaYC4wZqUnMsBJ2rlGI\"]\n，[\"2114 NC-16，牛顿，NC 28658\"，空，空，空，空，空，空，空,null,null,null,\"RIU-FSdWnM8f-IiOQhDwLoMoaMWYNVGI\"]\n]\n,null,null,0,null,[[null,null,null,null,null,null,null,3]\n,[ null,null,null,null,[null,null,null,null,nu]

子串很乱，但包含了我需要的内容。我听说使用正则表达式解析网页是不受欢迎的，但我想不出任何其他方式来获取此内容，这也可以避免使用动态抓取工具。

如果有人对解析 html returned 有任何建议，或者可以解释为什么我无法使用 xpath 或 css 选择器找到内容，但可以通过简单地解析原始 html 文本，我们将不胜感激。

感谢您的宝贵时间。

Answer 1

您无法使用 Xpath 或 css 选择器找到文本的原因是您找到的字符串在 javascript 数组对象的内容中。您假设您在屏幕上看到的文本元素是动态加载的，这是正确的；这些不是您从中读取字符串的地方。

我认为使用正则表达式解析 specific html 没有任何问题。我会确保获得完整的 html 而不仅仅是 html_text() 输出，在本例中是通过使用 httr 包。您可以像这样从页面中获取地址：

library(httr)

GetAddressFromGoogleMaps <- function(url)
{
  GET(url)                %>% 
  content("text")         %>%
  strsplit("spotlight")   %>%
  extract2(1)             %>%
  extract(-1)             %>%
  strsplit("[[]{3}(\")*") %>%
  extract2(1)             %>%
  extract(2)              %>%
  strsplit("\"")          %>%
  extract2(1)             %>%
  extract(1)
}

现在：

GetAddressFromGoogleMaps(url)
#[1] "920 NC-16, Crumpler, NC 28617, USA"

Rvest 抓取从 html_text() 返回的网页内容

Rvest scraping webpage content returned from html_text()

r

html-parsing

html-content-extraction

web-scraping

rvest