R：网络抓取 returns 错误的产品价格

Question

有人可以看看为什么我的代码 returns 产品价格错误吗？

例如，让我们看看这个电视：

海信 LED 超高清 4K 55" 智能电视 55A6GSV

-网页中的Precio antes (NORMAL): S/ 2,299
-Precio antes (NORMAL) 在我的结果中：S/ S/ 2,999

-网页Precio actual (INTERNET): S/ 1,699
-Precio actual (INTERNET) 在我的结果中：S/ 2,199

-网页中的Precio tarjeta：N/A
-Precio tarjeta 在我的结果中：S/ 1,999

代码：

library(rvest)
library(purrr)
library(tidyverse)

urls <- list("https://simple.ripley.com.pe/tecnologia/tv-y-cine-en-casa/televisores?page=1",
             "https://simple.ripley.com.pe/tecnologia/tv-y-cine-en-casa/televisores?page=2")



h <- urls %>% map(read_html)    # scrape once, parse as necessary

m <- h %>% map_df(~{
  r.precio.antes <- html_nodes(.x, '.catalog-prices__list-price') %>% html_text
  r.precio.actual <- html_nodes(.x, '.catalog-prices__offer-price') %>% html_text
  r.precio.tarjeta <- html_nodes(.x, '.catalog-prices__card-price') %>% html_text 
  
  
  tibble(
    periodo = lubridate::year(Sys.Date()),
    fecha = Sys.Date(),
    ecommerce = "ripley",
    producto = html_nodes(.x, ".catalog-product-details__name") %>% html_text,
    precio.antes = ifelse(length(r.precio.antes) == 0, NA, r.precio.antes),
    precio.actual = ifelse(length(r.precio.actual) == 0, NA,  r.precio.actual),
    precio.tarjeta = ifelse(length(r.precio.tarjeta) == 0, NA,  r.precio.tarjeta)
  )})

Answer 1

问题似乎出在 ifelse 中，它要求所有参数的长度相同。在这里，no 的情况下 length 大于 1。最好使用 if/else 和 return 作为 list 作为 data.frame/tibble 要求列相同 length

m <- h %>% map(~{
  r.precio.antes <- html_nodes(.x, '.catalog-prices__list-price') %>% html_text
  r.precio.actual <- html_nodes(.x, '.catalog-prices__offer-price') %>% html_text
  r.precio.tarjeta <- html_nodes(.x, '.catalog-prices__card-price') %>% html_text 
  
  r.precio.antes <- if(length(r.precio.antes) == 0) NA else r.precio.antes
  r.precio.actual <- if(length(r.precio.actual) == 0) NA else r.precio.actual
  r.precio.tarjeta <- if(length(r.precio.tarjeta) == 0) NA  else r.precio.tarjeta
 
 list(
      periodo = lubridate::year(Sys.Date()),
      fecha = Sys.Date(),
      ecommerce = "ripley",
      producto = html_nodes(.x, ".catalog-product-details__name") %>% html_text,
      precio.antes =r.precio.antes, precio.actual = r.precio.actual, precio.tarjeta = r.precio.tarjeta)
  })

-检查嵌套列表的每个元素的length

map(m, lengths)
[[1]]
       periodo          fecha      ecommerce       producto   precio.antes  precio.actual precio.tarjeta 
             1              1              1             48             44             48             18 

[[2]]
       periodo          fecha      ecommerce       producto   precio.antes  precio.actual precio.tarjeta 
             1              1              1             46             45             46              2

一个选项可能是

library(dplyr)
library(purrr)
library(tidyr)
library(data.table)
out <- h %>%
    map_dfr(~ html_nodes(.x, ".catalog-product-details__name, .catalog-prices__list-price, .catalog-prices__offer-price, .catalog-prices__card-price") %>%
    {tibble(col1 = html_attr(., "title"), col2 = html_text(.)) %>% 
      mutate(col1 = case_when(is.na(col1) ~ "product", TRUE ~ col1)) %>%
           mutate(grp = cumsum(col1 == "product"))  %>%
     pivot_wider(names_from = col1, values_from = col2) %>% 
        select(-grp) })

-输出

> out
# A tibble: 94 x 4
   product                                                                   `Precio Normal` `Precio Internet` `Precio Ripley`
   <chr>                                                                     <chr>           <chr>             <chr>          
 1 "TELEVISOR LG LED ULTRA HD 4K 50\" SMART TV THINQ AI 50UP7750PSB (2021)"  S/ 2,999        S/ 2,199          "S/ 1,999 "    
 2 "TELEVISOR SAMSUNG LED CRYSTAL ULTRA HD 4K SMART TV 65\" UN65AU7000GXPE"  S/ 4,099        S/ 2,699          "S/ 2,499 "    
 3 "TELEVISOR SAMSUNG CRYSTAL ULTRA HD 4K 58'' SMART TV UN58AU7000GXPE"      S/ 3,199        S/ 2,399          "S/ 2,299 "    
 4 "TELEVISOR LG OLED ULTRA HD 4K 48\" SMART TV THINQ AI OLED48A1PSA (2021)" S/ 4,799        S/ 3,699          "S/ 3,499 "    
 5 "TELEVISOR SAMSUNG QLED LIFESTYLE THE FRAME 55\" LS03A QLED 4K"           S/ 4,899        S/ 3,999           <NA>          
 6 "TELEVISOR TCL QLED ULTRA HD 4K 65\" SMART TV 65C715"                     S/ 3,499        S/ 3,199          "S/ 2,999 "    
 7 "TELEVISOR LG LED ULTRA HD 4K 43\" SMART TV THINQ AI 43UP7700PSB (2021)"  S/ 2,299        S/ 1,899          "S/ 1,799 "    
 8 "TELEVISOR HISENSE LED ULTRA HD 4K 55\" SMART TV 55A6GSV"                 S/ 2,299        S/ 1,699           <NA>          
 9 "TELEVISOR AOC LED ULTRA HD 4K 50\" SMART TV LE50U6305"                   S/ 2,299        S/ 1,749          "S/ 1,649 "    
10 "TELEVISOR LG LED ULTRA HD 4K 60\" SMART TV THINQ AI 60UP7750PSB (2021)"  S/ 3,899        S/ 3,199          "S/ 3,099 "    
# … with 84 more rows

-检查 OP 的评论

> out %>% 
   filter(product == "TELEVISOR LG NANOCELL ULTRA HD 4K 65\" SMART TV 65NANO96SNA (2020)")
# A tibble: 1 x 4
  product                                                              `Precio Normal` `Precio Internet` `Precio Ripley`
  <chr>                                                                <chr>           <chr>             <chr>          
1 "TELEVISOR LG NANOCELL ULTRA HD 4K 65\" SMART TV 65NANO96SNA (2020)" S/ 24,999       S/ 8,999          <NA>

与网页相同

或者OP中显示的第二个产品post

> out %>% 
   filter(str_detect(product, "55A6GSV"))
# A tibble: 1 x 4
  product                                                   `Precio Normal` `Precio Internet` `Precio Ripley`
  <chr>                                                     <chr>           <chr>             <chr>          
1 "TELEVISOR HISENSE LED ULTRA HD 4K 55\" SMART TV 55A6GSV" S/ 2,299        S/ 1,699          <NA>

Answer 2

如果您首先 select 每个电视列表的容器列表，然后将您的 css selector 应用于 map_dfr 和 map_dfr 中该列表中的每个节点=16=]，您可以利用 N/A 将在子节点不存在的情况下自动返回的事实：

library(rvest)
library(purrr)
library(tidyverse)

urls <- list(
  "https://simple.ripley.com.pe/tecnologia/tv-y-cine-en-casa/televisores?page=1",
  "https://simple.ripley.com.pe/tecnologia/tv-y-cine-en-casa/televisores?page=2"
)
h <- urls |> map(read_html) # scrape once, parse as necessary

df <- map_dfr(h |>
  map(~ .x |>
    html_nodes("div.catalog-product-item__container")), ~
data.frame(
  periodo = lubridate::year(Sys.Date()),
  fecha = Sys.Date(),
  ecommerce = "ripley",
  producto = .x |> html_node(".catalog-product-details__name") |> html_text(),
  precio.antes = .x |> html_node('[title="Precio Normal"]') |> html_text(),
  precio.actual = .x |> html_node('[title="Precio Internet"]') |> html_text(),
  precio.tarjeta = .x |> html_node('[title="Precio Ripley"]') |> html_text()
))

对于较早的 R 版本，将 |> 替换为 %>%。

R：网络抓取 returns 错误的产品价格

R: web-scraping returns wrong prices for products

r

rvest

purrr