Rvest 无法识别 css 选择器

Question

我正在尝试抓取此网站：

http://www.racingpost.com/greyhounds/result_home.sd#resultDay=2015-12-26&meetingId=18&isFullMeeting=true

通过 R 中的 rvest 包

不幸的是，rvest 似乎无法通过 CSS 选择器识别节点。

例如，如果我尝试提取每个 table（等级、奖品、距离）的 header 中的信息，其 CSS 选择器为“.black”并且我运行此代码：

URL <- read_html("http://www.racingpost.com/greyhounds/result_home.sd#resultDay=2015-12-26&meetingId=18&isFullMeeting=true")
nodes<-html_nodes(URL, ".black")

nodes 结果是一个空列表，所以它没有抓取任何东西。

Answer 1

它正在发出 XHR 请求以生成 HTML。试试这个（这也应该更容易自动化数据捕获）：

library(httr)
library(xml2)
library(rvest)

res <- GET("http://www.racingpost.com/greyhounds/result_by_meeting_full.sd",
           query=list(r_date="2015-12-26",
                      meeting_id=18))

doc <- read_html(content(res, as="text"))

html_nodes(doc, ".black")
## {xml_nodeset (56)}
##  [1] <span class="black">A9</span>
##  [2] <span class="black">£61</span>
##  [3] <span class="black">470m</span>
##  [4] <span class="black">-30</span>
##  [5] <span class="black">H2</span>
##  [6] <span class="black">£105</span>
##  [7] <span class="black">470m</span>
##  [8] <span class="black">-30</span>
##  [9] <span class="black">A7</span>
## [10] <span class="black">£61</span>
## [11] <span class="black">470m</span>
## [12] <span class="black">-30</span>
## [13] <span class="black">A5</span>
## [14] <span class="black">£66</span>
## [15] <span class="black">470m</span>
## [16] <span class="black">-30</span>
## [17] <span class="black">A8</span>
## [18] <span class="black">£61</span>
## [19] <span class="black">470m</span>
## [20] <span class="black">-20</span>
## ...

Answer 2

您的选择器很好，rvest 工作正常。问题是 您要查找的内容不在 url 对象 .

中

如果您打开该网站并使用网络浏览器检查工具，您会看到您想要的所有数据都是 <div id="resultMainOutput"> 的后代。现在，如果您查找该网站的源代码，您将看到（为便于阅读而添加的换行符）：

<div id="resultMainOutput">
    <div class="wait">
       <img src="http://ui.racingpost.com/img/all/loading.gif" alt="Loading..." />
    </div>
</div>

您想要的数据是动态加载的，而rvest无法处理。它只能获取网站源代码并检索没有任何客户端处理的任何内容。

在 rvest-introducing blog post 中提出了完全相同的问题，这是包作者不得不说的：

You have two options for pages like that:

Use the debug console in the web browser to reverse engineer the communications protocol and request the raw data directly from the server.

Use a package like RSelenium to automate a web browser.

如果您不需要重复获取该数据，或者您可以在每次分析时接受一些手动工作，最简单的解决方法是：

在选择的网络浏览器中打开网站
使用网络浏览器检查工具，复制当前网站内容（整个页面或仅<div id="resultMainOutput"> 内容）
将那个东西粘贴到文本编辑器中并将其另存为新文件
运行对该文件的分析

> url <- read_html("/tmp/racingpost.html")
> html_nodes(url, ".black")
# {xml_nodeset (56)}
# [1] <span class="black">A9</span>
# [2] <span class="black">Â£61</span>
# [3] <span class="black">470m</span>
# [4] <span class="black">-30</span>
# (skip the rest)

如您所见，过程中出现了一些编码问题，但可以稍后解决。

Rvest 无法识别 css 选择器

Rvest not recognizing css selector

r

web-scraping

rvest