在 r 中解析包含 HTML 的 JSON 文档

Question

如果我查询我的目标 link 如下：

library(jsonlite)
link <- "https://www.forest-trends.org/wp-content/themes/foresttrends/map_tools/project_fetch_single.php?pid=1"
df <- fromJSON(link)

我得到一个包含一个元素的 JSON 列表：df$html。我想使用 rvest 解析此 HTML，以便访问 psize 和 pstatus 等标签。但是双反斜杠 \ 似乎阻止了我。知道如何正确制定我的 rvest 查询吗？我在想类似的东西：

df$html %>% html_node(xpath = '//div[contains(@class, \"psize\")]') %>% html_text()

Answer 1

结合几个不同的功能，你可以达到那个目的。这不是 100% 正确的答案，但它可以提供一些关于如何格式化字符串的想法。

library(rvest)
library(tidyr)

split <- read_html(link) %>% 
  html_node(xpath='/html/body/div') %>% 
  html_text() %>% 
  strsplit(., split = "\\n|\\t")

split <- split[[1]][!is.na(split[[1]]) & split[[1]] != ""]
data.frame(col1 = split[1:5]) %>% 
  separate(col = col1, into = c("col1", "col2"), sep = ": ", extra = "drop")

          col1                                                             col2
1          Size                                                         85000 ha
2        Status                                                   In development
3   Description                              REDD project in Madre de Dios, Peru
4     Objective Carbon sequestration or avoided, Carbon sequestration or avoided
5 Interventions                                   Afforestation or reforestation

在 r 中解析包含 HTML 的 JSON 文档

Parsing a JSON document that contains HTML in r

r

jsonlite

rvest