如何使用 xpath 和 rvest 函数 html_nodes() 提取数据？

Question

我正在使用 R 包 rvest 进行网络抓取。我没有收到错误，而是我的代码在环境中捕获了一个 空字符 。

我的代码：

amore_tomato_page <- "https://thrivemarket.com/p/amore-tomato-paste"
amore_tomato <- read_html(amore_tomato_page)
amore_tomato_body <- amore_tomato %>%
  html_node("body") %>%
  html_children()

allergens <- amore_tomato %>%
  html_nodes(xpath = '/html/body/div[1]/div[2]/div[4]/div[6]/div/div[1]/section/div/div/div/div/div[2]/div[2]/p[2]') %>%
  html_attr()

ingredients <- amore_tomato %>%
  html_nodes(xpath = '/html/body/div[1]/div[2]/div[4]/div[6]/div/div[1]/section/div/div/div/div/div[2]/div[1]/p') %>%
  html_attr()

我正在尝试提取产品（以及数百种产品）的过敏原信息和成分。

提前感谢您帮助解决此问题！

最佳，
~梅拉

Answer 1

该数据是从包含 JSON 字符串的脚本标签动态加载的。您可以提取它并使用 jsonlite 反序列化为 JSON 对象并解析出感兴趣的信息：

library(tidyverse)
library(rvest)
library(jsonlite)

amore_tomato_page <- "https://thrivemarket.com/p/amore-tomato-paste"
amore_tomato <- read_html(amore_tomato_page)
data <- amore_tomato %>% html_element('#__NEXT_DATA__') %>% html_text() %>% jsonlite::parse_json(simplifyVector = T)
allergy_info <- filter(data$props$pageProps$product$nutrition_info, friendly_label == 'Warning / Allergen Information')$value
ingredients <- data$props$pageProps$product$ingredients

如何使用 xpath 和 rvest 函数 html_nodes() 提取数据？

How can I extract data using xpath and the rvest function html_nodes()?

xpath

r

rvest