R:网络抓取 returns 错误的产品价格
R: web-scraping returns wrong prices for products
有人可以看看为什么我的代码 returns 产品价格错误吗?
例如,让我们看看这个电视:
海信 LED 超高清 4K 55" 智能电视 55A6GSV
-网页中的Precio antes (NORMAL): S/ 2,299
-Precio antes (NORMAL) 在我的结果中:S/ S/ 2,999
-网页Precio actual (INTERNET): S/ 1,699
-Precio actual (INTERNET) 在我的结果中:S/ 2,199
-网页中的Precio tarjeta:N/A
-Precio tarjeta 在我的结果中:S/ 1,999
代码:
library(rvest)
library(purrr)
library(tidyverse)
urls <- list("https://simple.ripley.com.pe/tecnologia/tv-y-cine-en-casa/televisores?page=1",
"https://simple.ripley.com.pe/tecnologia/tv-y-cine-en-casa/televisores?page=2")
h <- urls %>% map(read_html) # scrape once, parse as necessary
m <- h %>% map_df(~{
r.precio.antes <- html_nodes(.x, '.catalog-prices__list-price') %>% html_text
r.precio.actual <- html_nodes(.x, '.catalog-prices__offer-price') %>% html_text
r.precio.tarjeta <- html_nodes(.x, '.catalog-prices__card-price') %>% html_text
tibble(
periodo = lubridate::year(Sys.Date()),
fecha = Sys.Date(),
ecommerce = "ripley",
producto = html_nodes(.x, ".catalog-product-details__name") %>% html_text,
precio.antes = ifelse(length(r.precio.antes) == 0, NA, r.precio.antes),
precio.actual = ifelse(length(r.precio.actual) == 0, NA, r.precio.actual),
precio.tarjeta = ifelse(length(r.precio.tarjeta) == 0, NA, r.precio.tarjeta)
)})
问题似乎出在 ifelse
中,它要求所有参数的长度相同。在这里,no
的情况下 length
大于 1。最好使用 if/else
和 return 作为 list
作为 data.frame/tibble
要求列相同 length
m <- h %>% map(~{
r.precio.antes <- html_nodes(.x, '.catalog-prices__list-price') %>% html_text
r.precio.actual <- html_nodes(.x, '.catalog-prices__offer-price') %>% html_text
r.precio.tarjeta <- html_nodes(.x, '.catalog-prices__card-price') %>% html_text
r.precio.antes <- if(length(r.precio.antes) == 0) NA else r.precio.antes
r.precio.actual <- if(length(r.precio.actual) == 0) NA else r.precio.actual
r.precio.tarjeta <- if(length(r.precio.tarjeta) == 0) NA else r.precio.tarjeta
list(
periodo = lubridate::year(Sys.Date()),
fecha = Sys.Date(),
ecommerce = "ripley",
producto = html_nodes(.x, ".catalog-product-details__name") %>% html_text,
precio.antes =r.precio.antes, precio.actual = r.precio.actual, precio.tarjeta = r.precio.tarjeta)
})
-检查嵌套列表的每个元素的length
map(m, lengths)
[[1]]
periodo fecha ecommerce producto precio.antes precio.actual precio.tarjeta
1 1 1 48 44 48 18
[[2]]
periodo fecha ecommerce producto precio.antes precio.actual precio.tarjeta
1 1 1 46 45 46 2
一个选项可能是
library(dplyr)
library(purrr)
library(tidyr)
library(data.table)
out <- h %>%
map_dfr(~ html_nodes(.x, ".catalog-product-details__name, .catalog-prices__list-price, .catalog-prices__offer-price, .catalog-prices__card-price") %>%
{tibble(col1 = html_attr(., "title"), col2 = html_text(.)) %>%
mutate(col1 = case_when(is.na(col1) ~ "product", TRUE ~ col1)) %>%
mutate(grp = cumsum(col1 == "product")) %>%
pivot_wider(names_from = col1, values_from = col2) %>%
select(-grp) })
-输出
> out
# A tibble: 94 x 4
product `Precio Normal` `Precio Internet` `Precio Ripley`
<chr> <chr> <chr> <chr>
1 "TELEVISOR LG LED ULTRA HD 4K 50\" SMART TV THINQ AI 50UP7750PSB (2021)" S/ 2,999 S/ 2,199 "S/ 1,999 "
2 "TELEVISOR SAMSUNG LED CRYSTAL ULTRA HD 4K SMART TV 65\" UN65AU7000GXPE" S/ 4,099 S/ 2,699 "S/ 2,499 "
3 "TELEVISOR SAMSUNG CRYSTAL ULTRA HD 4K 58'' SMART TV UN58AU7000GXPE" S/ 3,199 S/ 2,399 "S/ 2,299 "
4 "TELEVISOR LG OLED ULTRA HD 4K 48\" SMART TV THINQ AI OLED48A1PSA (2021)" S/ 4,799 S/ 3,699 "S/ 3,499 "
5 "TELEVISOR SAMSUNG QLED LIFESTYLE THE FRAME 55\" LS03A QLED 4K" S/ 4,899 S/ 3,999 <NA>
6 "TELEVISOR TCL QLED ULTRA HD 4K 65\" SMART TV 65C715" S/ 3,499 S/ 3,199 "S/ 2,999 "
7 "TELEVISOR LG LED ULTRA HD 4K 43\" SMART TV THINQ AI 43UP7700PSB (2021)" S/ 2,299 S/ 1,899 "S/ 1,799 "
8 "TELEVISOR HISENSE LED ULTRA HD 4K 55\" SMART TV 55A6GSV" S/ 2,299 S/ 1,699 <NA>
9 "TELEVISOR AOC LED ULTRA HD 4K 50\" SMART TV LE50U6305" S/ 2,299 S/ 1,749 "S/ 1,649 "
10 "TELEVISOR LG LED ULTRA HD 4K 60\" SMART TV THINQ AI 60UP7750PSB (2021)" S/ 3,899 S/ 3,199 "S/ 3,099 "
# … with 84 more rows
-检查 OP 的评论
> out %>%
filter(product == "TELEVISOR LG NANOCELL ULTRA HD 4K 65\" SMART TV 65NANO96SNA (2020)")
# A tibble: 1 x 4
product `Precio Normal` `Precio Internet` `Precio Ripley`
<chr> <chr> <chr> <chr>
1 "TELEVISOR LG NANOCELL ULTRA HD 4K 65\" SMART TV 65NANO96SNA (2020)" S/ 24,999 S/ 8,999 <NA>
与网页相同
或者OP中显示的第二个产品post
> out %>%
filter(str_detect(product, "55A6GSV"))
# A tibble: 1 x 4
product `Precio Normal` `Precio Internet` `Precio Ripley`
<chr> <chr> <chr> <chr>
1 "TELEVISOR HISENSE LED ULTRA HD 4K 55\" SMART TV 55A6GSV" S/ 2,299 S/ 1,699 <NA>
如果您首先 select 每个电视列表的容器列表,然后将您的 css selector 应用于 map_dfr 和 map_dfr 中该列表中的每个节点=16=],您可以利用 N/A 将在子节点不存在的情况下自动返回的事实:
library(rvest)
library(purrr)
library(tidyverse)
urls <- list(
"https://simple.ripley.com.pe/tecnologia/tv-y-cine-en-casa/televisores?page=1",
"https://simple.ripley.com.pe/tecnologia/tv-y-cine-en-casa/televisores?page=2"
)
h <- urls |> map(read_html) # scrape once, parse as necessary
df <- map_dfr(h |>
map(~ .x |>
html_nodes("div.catalog-product-item__container")), ~
data.frame(
periodo = lubridate::year(Sys.Date()),
fecha = Sys.Date(),
ecommerce = "ripley",
producto = .x |> html_node(".catalog-product-details__name") |> html_text(),
precio.antes = .x |> html_node('[title="Precio Normal"]') |> html_text(),
precio.actual = .x |> html_node('[title="Precio Internet"]') |> html_text(),
precio.tarjeta = .x |> html_node('[title="Precio Ripley"]') |> html_text()
))
对于较早的 R 版本,将 |> 替换为 %>%。
有人可以看看为什么我的代码 returns 产品价格错误吗?
例如,让我们看看这个电视:
海信 LED 超高清 4K 55" 智能电视 55A6GSV
-网页中的Precio antes (NORMAL): S/ 2,299
-Precio antes (NORMAL) 在我的结果中:S/ S/ 2,999
-网页Precio actual (INTERNET): S/ 1,699
-Precio actual (INTERNET) 在我的结果中:S/ 2,199
-网页中的Precio tarjeta:N/A
-Precio tarjeta 在我的结果中:S/ 1,999
代码:
library(rvest)
library(purrr)
library(tidyverse)
urls <- list("https://simple.ripley.com.pe/tecnologia/tv-y-cine-en-casa/televisores?page=1",
"https://simple.ripley.com.pe/tecnologia/tv-y-cine-en-casa/televisores?page=2")
h <- urls %>% map(read_html) # scrape once, parse as necessary
m <- h %>% map_df(~{
r.precio.antes <- html_nodes(.x, '.catalog-prices__list-price') %>% html_text
r.precio.actual <- html_nodes(.x, '.catalog-prices__offer-price') %>% html_text
r.precio.tarjeta <- html_nodes(.x, '.catalog-prices__card-price') %>% html_text
tibble(
periodo = lubridate::year(Sys.Date()),
fecha = Sys.Date(),
ecommerce = "ripley",
producto = html_nodes(.x, ".catalog-product-details__name") %>% html_text,
precio.antes = ifelse(length(r.precio.antes) == 0, NA, r.precio.antes),
precio.actual = ifelse(length(r.precio.actual) == 0, NA, r.precio.actual),
precio.tarjeta = ifelse(length(r.precio.tarjeta) == 0, NA, r.precio.tarjeta)
)})
问题似乎出在 ifelse
中,它要求所有参数的长度相同。在这里,no
的情况下 length
大于 1。最好使用 if/else
和 return 作为 list
作为 data.frame/tibble
要求列相同 length
m <- h %>% map(~{
r.precio.antes <- html_nodes(.x, '.catalog-prices__list-price') %>% html_text
r.precio.actual <- html_nodes(.x, '.catalog-prices__offer-price') %>% html_text
r.precio.tarjeta <- html_nodes(.x, '.catalog-prices__card-price') %>% html_text
r.precio.antes <- if(length(r.precio.antes) == 0) NA else r.precio.antes
r.precio.actual <- if(length(r.precio.actual) == 0) NA else r.precio.actual
r.precio.tarjeta <- if(length(r.precio.tarjeta) == 0) NA else r.precio.tarjeta
list(
periodo = lubridate::year(Sys.Date()),
fecha = Sys.Date(),
ecommerce = "ripley",
producto = html_nodes(.x, ".catalog-product-details__name") %>% html_text,
precio.antes =r.precio.antes, precio.actual = r.precio.actual, precio.tarjeta = r.precio.tarjeta)
})
-检查嵌套列表的每个元素的length
map(m, lengths)
[[1]]
periodo fecha ecommerce producto precio.antes precio.actual precio.tarjeta
1 1 1 48 44 48 18
[[2]]
periodo fecha ecommerce producto precio.antes precio.actual precio.tarjeta
1 1 1 46 45 46 2
一个选项可能是
library(dplyr)
library(purrr)
library(tidyr)
library(data.table)
out <- h %>%
map_dfr(~ html_nodes(.x, ".catalog-product-details__name, .catalog-prices__list-price, .catalog-prices__offer-price, .catalog-prices__card-price") %>%
{tibble(col1 = html_attr(., "title"), col2 = html_text(.)) %>%
mutate(col1 = case_when(is.na(col1) ~ "product", TRUE ~ col1)) %>%
mutate(grp = cumsum(col1 == "product")) %>%
pivot_wider(names_from = col1, values_from = col2) %>%
select(-grp) })
-输出
> out
# A tibble: 94 x 4
product `Precio Normal` `Precio Internet` `Precio Ripley`
<chr> <chr> <chr> <chr>
1 "TELEVISOR LG LED ULTRA HD 4K 50\" SMART TV THINQ AI 50UP7750PSB (2021)" S/ 2,999 S/ 2,199 "S/ 1,999 "
2 "TELEVISOR SAMSUNG LED CRYSTAL ULTRA HD 4K SMART TV 65\" UN65AU7000GXPE" S/ 4,099 S/ 2,699 "S/ 2,499 "
3 "TELEVISOR SAMSUNG CRYSTAL ULTRA HD 4K 58'' SMART TV UN58AU7000GXPE" S/ 3,199 S/ 2,399 "S/ 2,299 "
4 "TELEVISOR LG OLED ULTRA HD 4K 48\" SMART TV THINQ AI OLED48A1PSA (2021)" S/ 4,799 S/ 3,699 "S/ 3,499 "
5 "TELEVISOR SAMSUNG QLED LIFESTYLE THE FRAME 55\" LS03A QLED 4K" S/ 4,899 S/ 3,999 <NA>
6 "TELEVISOR TCL QLED ULTRA HD 4K 65\" SMART TV 65C715" S/ 3,499 S/ 3,199 "S/ 2,999 "
7 "TELEVISOR LG LED ULTRA HD 4K 43\" SMART TV THINQ AI 43UP7700PSB (2021)" S/ 2,299 S/ 1,899 "S/ 1,799 "
8 "TELEVISOR HISENSE LED ULTRA HD 4K 55\" SMART TV 55A6GSV" S/ 2,299 S/ 1,699 <NA>
9 "TELEVISOR AOC LED ULTRA HD 4K 50\" SMART TV LE50U6305" S/ 2,299 S/ 1,749 "S/ 1,649 "
10 "TELEVISOR LG LED ULTRA HD 4K 60\" SMART TV THINQ AI 60UP7750PSB (2021)" S/ 3,899 S/ 3,199 "S/ 3,099 "
# … with 84 more rows
-检查 OP 的评论
> out %>%
filter(product == "TELEVISOR LG NANOCELL ULTRA HD 4K 65\" SMART TV 65NANO96SNA (2020)")
# A tibble: 1 x 4
product `Precio Normal` `Precio Internet` `Precio Ripley`
<chr> <chr> <chr> <chr>
1 "TELEVISOR LG NANOCELL ULTRA HD 4K 65\" SMART TV 65NANO96SNA (2020)" S/ 24,999 S/ 8,999 <NA>
与网页相同
或者OP中显示的第二个产品post
> out %>%
filter(str_detect(product, "55A6GSV"))
# A tibble: 1 x 4
product `Precio Normal` `Precio Internet` `Precio Ripley`
<chr> <chr> <chr> <chr>
1 "TELEVISOR HISENSE LED ULTRA HD 4K 55\" SMART TV 55A6GSV" S/ 2,299 S/ 1,699 <NA>
如果您首先 select 每个电视列表的容器列表,然后将您的 css selector 应用于 map_dfr 和 map_dfr 中该列表中的每个节点=16=],您可以利用 N/A 将在子节点不存在的情况下自动返回的事实:
library(rvest)
library(purrr)
library(tidyverse)
urls <- list(
"https://simple.ripley.com.pe/tecnologia/tv-y-cine-en-casa/televisores?page=1",
"https://simple.ripley.com.pe/tecnologia/tv-y-cine-en-casa/televisores?page=2"
)
h <- urls |> map(read_html) # scrape once, parse as necessary
df <- map_dfr(h |>
map(~ .x |>
html_nodes("div.catalog-product-item__container")), ~
data.frame(
periodo = lubridate::year(Sys.Date()),
fecha = Sys.Date(),
ecommerce = "ripley",
producto = .x |> html_node(".catalog-product-details__name") |> html_text(),
precio.antes = .x |> html_node('[title="Precio Normal"]') |> html_text(),
precio.actual = .x |> html_node('[title="Precio Internet"]') |> html_text(),
precio.tarjeta = .x |> html_node('[title="Precio Ripley"]') |> html_text()
))
对于较早的 R 版本,将 |> 替换为 %>%。