R web 抓取 plotly trace 悬停文本没有硒或 phantomjs

Question

我正在尝试从网络上发布的一些情节痕迹中抓取悬停文本内容。我以前没有执行过这种类型的抓取，如果可能的话，我正在尝试在没有 selenium 或 phantomjs 的情况下在 R 中执行此操作......也许使用 V8？我想知道是否有人可以指出我正确的方向。 Link地块如下。专门在图 21 的图中查找数据：按区域划分的艾伯塔省 COVID-19 阳性率。谢谢！

https://www.alberta.ca/stats/covid-19-alberta-statistics.htm

Answer 1

使用 rvest 和 jsonlite 以下代码将为您提供所需的数据。 plot.ly 个图表的数据存储在 <script> 个标签中。

第一步是识别感兴趣图形的小部件 ID，下面的代码向您展示了如何通过查找感兴趣图形的标题文本来查找小部件 ID。然后您可以使用 html_nodes() 和 html_attrs() 搜索正确的节点。 jsonlite::fromJSON() 将 JSON 数据转换为 R 列表 object。

library(rvest)
library(jsonlite)
library(purrr)
library(stringr)
library(dplyr)


url <-
  "https://www.alberta.ca/stats/covid-19-alberta-statistics.htm#laboratory-testing"

raw_html <- read_html(url)

# get widget ID

caption <-
  "Figure 21: Positivity rate for COVID-19 in Alberta by zone."

figure_divs <- html_nodes(raw_html, ".figure")

figure_21_div_lgl <- grepl(caption, figure_divs)

widget_id <-
  figure_divs[figure_21_div_lgl] %>%
  html_nodes("div") %>%
  html_attr("id")

# find data for the correct widget_id

data_for <-
  html_nodes(raw_html, "script") %>%
  html_attr("data-for")

data_for_figure_21_lgl <-
  !is.na(data_for) & data_for == widget_id

data_for_figure_21 <-
  html_nodes(raw_html, "script") %>%
  .[data_for_figure_21_lgl] %>%
  html_text()

dff21_l <- fromJSON(data_for_figure_21)

为了提取工具提示中显示的数据（“悬停文本”），我们需要遍历不同的元素。首先用html_text()提取DOM结构。之后我们用 html_text() 提取文本。我们对元素进行多次迭代以拆分和清理字符串，以便我们最终将结果转换为 data.frame.

tooltip_text_raw <- unlist(dff21_l$x$data$text)
tooltip_text <- map(tooltip_text_raw, read_html)
tooltip_text <- map(tooltip_text, html_text) %>% unlist()

tooltip_text_split <- strsplit(tooltip_text, "\:")

tooltip_text_split_almost_clean <-
  map(tooltip_text_split,
      ~ gsub("Report Date|Percent|Number of tests", "", .x))

tooltip_text_split_clean <-
  map(tooltip_text_split_almost_clean, ~ str_squish(.[. != ""]))

tests_df <-
  map_dfr(tooltip_text_split_clean,
          ~ data.frame(
            date = as.Date(.x[1]),
            percent = .x[2],
            tests = .x[3]
          ))

head(tests_df)
#>         date percent tests
#> 1 2020-03-06    9.68    31
#> 2 2020-03-07    0.00   142
#> 3 2020-03-08    0.00   213
#> 4 2020-03-09    2.51   239
#> 5 2020-03-10    3.90   282
#> 6 2020-03-11    1.05   572

R web 抓取 plotly trace 悬停文本没有硒或 phantomjs

R web scraping plotly trace hover text without selenium or phantomjs

html

javascript

r

web-scraping

rvest