如何从 R 中指定 class 的网站解析地址？

Question

我想解析以下网站所有店铺的地址： https://www.carrefour.fr/magasin/region/ looping through the regions. So starting for example with the region "auvergne-rhone-alpes-84", hence full url = https://www.carrefour.fr/magasin/region/auvergne-rhone-alpes-84。请注意，之后我可以添加更多区域，我现在只想让它与一个区域一起使用。

carrefour <- "https://www.carrefour.fr/magasin/region/"
addresses_vector = c()

for (current_region in c("auvergne-rhone-alpes-84")) {
  current_region_url = paste(carrefour, current_region, "/", sep="")
  
  x <- GET(url=current_region_url)
  
  html_doc <- read_html(x) %>%
    html_nodes("[class = 'ds-body-text ds-store-card__details--content ds-body-text--size-m ds-body-text--color-standard-2']")
  
  addresses_vector <- c(addresses_vector, html_doc %>%
                          rvest::html_nodes('body')%>%
                          xml2::xml_find_all(".//div[contains(@class, 'ds-body-text ds-store-card__details--content ds-body-text--size-m ds-body-text--color-standard-2')]") %>%
                          rvest::html_text())
}

我也尝试过 x%>% read_html() %>% rvest::html_nodes(xpath="/html/body/main/div[1]/div/div[2]/div[2]/ol/li[1]/div/div[1]/div[2]/div[2]")%>% rvest::html_text()（手动复制整个 xpath）或 x%>%read_html() %>%html_nodes("div.ds-body-text.ds-store-card__details--content.ds-body-text--size-m.ds-body-text--color-standard-2") %>%html_text() 和其他几种方法，但我总是返回一个 character(0) 元素。

感谢任何帮助！

Answer 1

您可以编写几个自定义函数来提供帮助，然后使用 purrr 将存储数据函数映射到第一个辅助函数输出的输入。

首先，提取区域 url 并提取区域名称和区域 ID。将这些存储在 tibble 中。这是第一个辅助函数 get_regions.

然后使用另一个函数，get_store_info，从这些区域 url 中提取商店信息，存储在 div 标签中，当 JavaScript 时从中动态提取在浏览器中运行，但在使用 rvest.

时不运行

在区域 URL 和区域 ID 列表上应用提取商店信息的函数。

如果您使用 map2_dfr 将区域 ID 和区域 link 都传递给提取商店数据的函数，那么您会将区域 ID 返回给 link 以加入map2_dfr 与之前生成的区域 tibble 的结果的结果。

然后进行一些色谱柱清理，例如，删除不需要的。

library(rvest)
library(purrr)
library(dplyr)
library(readr)
library(jsonlite)

get_regions <- function() {
  url <- "https://www.carrefour.fr/magasin"
  page <- read_html(url)
  regions <- page %>% html_nodes(".store-locator-footer-list__item > a")
  t <- tibble(
    region = regions %>% html_text(trim = T),
    link = regions %>% html_attr("href") %>% url_absolute(url),
    region_id = NA_integer_
  ) %>% mutate(region_id = str_match(link, "-(\d+)$")[, 2] %>%
    as.integer())
  return(t)
}

get_store_info <- function(region_url, r_id) {
  region_page <- read_html(region_url)
  store_data <- region_page %>%
    html_node("#store-locator") %>%
    html_attr(":context-stores") %>%
    parse_json(simplifyVector = T) %>%
    as_tibble()
  store_data$region_id <- r_id
  return(store_data)
}

region_df <- get_regions()

store_df <- map2_dfr(region_df$link, region_df$region_id, get_store_info)

final_df <- inner_join(region_df, store_df, by = 'region_id') # now clean columns within this.

如何从 R 中指定 class 的网站解析地址？

How to parse addresses from website specifying class in R?

html

css

r

rvest

xml2