如何从此特定网页抓取数据并将输出保存在数据框中?
How to scrape data from this specific webpage and save the output in a data frame?
我是使用 R
进行网络抓取的新手,我需要帮助才能完成这项任务。我正试图从这个特定的网页中抓取数据,但我在整个过程中卡在了一个特定的点。
这是URL:webpage
基本上,我试图从网页中捕获 3 个元素:
(1) 房间类型(css 选择器:.room h3
)
(2) 膳食计划(css 选择器:.meal-plan-title
)
(3) 价格(css 选择器:.price
)
我已经能够从网页中提取这些值。但是我很难匹配网页上显示的值。
我的 R
代码是这样的:
library(rvest)
library(dplyr)
library(stringr)
library(tables)
MealPlan <- read_html("https://www.hotelissima.fr/s/h/ile-maurice/mahebourg/astroea-beach.html?searchType=accomodation&searchId=4&guideId=&filters=&withFlights=false&airportCode=PAR&airport=Paris&search=astroea+beach&startdate=08%2F11%2F2021&stopdate=15%2F11%2F2021&duration=7&travelers=En+couple&travelType=&rooms%5B0%5D.nbAdults=2&rooms%5B0%5D.nbChilds=0&rooms%5B0%5D.birthdates%5B0%5D=&rooms%5B0%5D.birthdates%5B1%5D=&rooms%5B0%5D.birthdates%5B2%5D=&rooms%5B0%5D.birthdates%5B3%5D=&rooms%5B0%5D.birthdates%5B4%5D=") %>%
#html_nodes(".meal-plan-text") %>%
html_nodes(".meal-plan-title") %>%
html_text()
MealPlan
Price <- read_html("https://www.hotelissima.fr/s/h/ile-maurice/mahebourg/astroea-beach.html?searchType=accomodation&searchId=4&guideId=&filters=&withFlights=false&airportCode=PAR&airport=Paris&search=astroea+beach&startdate=08%2F11%2F2021&stopdate=15%2F11%2F2021&duration=7&travelers=En+couple&travelType=&rooms%5B0%5D.nbAdults=2&rooms%5B0%5D.nbChilds=0&rooms%5B0%5D.birthdates%5B0%5D=&rooms%5B0%5D.birthdates%5B1%5D=&rooms%5B0%5D.birthdates%5B2%5D=&rooms%5B0%5D.birthdates%5B3%5D=&rooms%5B0%5D.birthdates%5B4%5D=") %>%
html_nodes(".price") %>%
html_text()
Price
RoomType <- read_html("https://www.hotelissima.fr/s/h/ile-maurice/mahebourg/astroea-beach.html?searchType=accomodation&searchId=4&guideId=&filters=&withFlights=false&airportCode=PAR&airport=Paris&search=astroea+beach&startdate=08%2F11%2F2021&stopdate=15%2F11%2F2021&duration=7&travelers=En+couple&travelType=&rooms%5B0%5D.nbAdults=2&rooms%5B0%5D.nbChilds=0&rooms%5B0%5D.birthdates%5B0%5D=&rooms%5B0%5D.birthdates%5B1%5D=&rooms%5B0%5D.birthdates%5B2%5D=&rooms%5B0%5D.birthdates%5B3%5D=&rooms%5B0%5D.birthdates%5B4%5D=") %>%
html_nodes(".room h3") %>%
html_text()
RoomType
我想在数据框中输出如下:
RoomType MealPlan Price
Chambre Standard Petit Dej.+Diner 584 € / pers
Chambre Standard All inclusive 864 € / pers
Chambre Confort Petit Dej.+Diner 715 € / pers
Chambre Confort All inclusive 995 € / pers
Bungalow Petit Dej.+Diner 781 € / pers
Bungalow All inclusive 1061 € / pers
Chambre Deluxe Petit Dej.+Diner 847 € / pers
Chambre Deluxe All inclusive 1127 € / pers
非常感谢任何帮助。
您可以使用 purrr
中的 map_dfr
生成一个宽 DataFrame,其中包含用于膳食计划的单独列,然后 pivot_longer
将它们放入一个列中,其中包含值的价格信息。您传入 map_dfr
的初始列表将是代表每个房间列表的父元素,使用 css 选择器 .room
.
收集
所提供的 url 的所有房间都具有相同的价格条目组合,即 Petit déj. + diner
和 All inclusive
。为了满足其他页面上的任何内容,您需要确定所有情况,或者首先将所有页面的所有 .room
收集到一个列表中,然后使用 read.dcf 之类的方法来列出所有可能的情况,并在给定列表中缺少的地方输入 N/A。您需要确保为 debian 控制格式的 key:value 配对插入“:”。
library(rvest)
library(purrr)
library(dplyr)
library(tidyr)
page <- read_html("https://www.hotelissima.fr/s/h/ile-maurice/mahebourg/astroea-beach.html?searchType=accomodation&searchId=4&guideId=&filters=&withFlights=false&airportCode=PAR&airport=Paris&search=astroea%20beach&startdate=08%2F11%2F2021&stopdate=15%2F11%2F2021&duration=7&travelers=En%20couple&travelType=&rooms%5B0%5D.nbAdults=2&rooms%5B0%5D.nbChilds=0&rooms%5B0%5D.birthdates%5B0%5D=&rooms%5B0%5D.birthdates%5B1%5D=&rooms%5B0%5D.birthdates%5B2%5D=&rooms%5B0%5D.birthdates%5B3%5D=&rooms%5B0%5D.birthdates%5B4%5D=")
df <- map_dfr(page |> html_elements(".room"), ~
data.frame(
RoomType = .x |> html_element("h3") |> html_text(),
`Petit Dej.+Diner` = .x |> html_element(".price") |> html_text() |> trimws(),
`All inclusive` = .x |> html_element("div:nth-child(5) .price") |> html_text() |> trimws()
)) |>
pivot_longer(!RoomType, names_to = "MealPlan", values_to = "Price")
旧的 R 版本:
library(rvest)
library(purrr)
library(dplyr)
library(tidyr)
page <- read_html("https://www.hotelissima.fr/s/h/ile-maurice/mahebourg/astroea-beach.html?searchType=accomodation&searchId=4&guideId=&filters=&withFlights=false&airportCode=PAR&airport=Paris&search=astroea%20beach&startdate=08%2F11%2F2021&stopdate=15%2F11%2F2021&duration=7&travelers=En%20couple&travelType=&rooms%5B0%5D.nbAdults=2&rooms%5B0%5D.nbChilds=0&rooms%5B0%5D.birthdates%5B0%5D=&rooms%5B0%5D.birthdates%5B1%5D=&rooms%5B0%5D.birthdates%5B2%5D=&rooms%5B0%5D.birthdates%5B3%5D=&rooms%5B0%5D.birthdates%5B4%5D=")
df <- map_dfr(page %>% html_elements(".room"), ~
data.frame(
RoomType = .x %>% html_element("h3") %>% html_text(),
`Petit Dej.+Diner` = .x %>% html_element(".price") %>% html_text() %>% trimws(),
`All inclusive` = .x %>% html_element("div:nth-child(5) .price") %>% html_text() %>% trimws()
)) %>%
pivot_longer(!RoomType, names_to = "MealPlan", values_to = "Price")
read.dcf
处理不同价目表的示例。
对于read.dcf,我采用了@akrun in their answer , whereby read.dcf
is used to map out all the meal plans, with a price, present, and put NA where a meal plan is not present for a given entry. For the xpath, I used an example given by @tomalak in their answer here
使用的方法
library(tidyverse)
library(rvest)
urls <- c(
"https://www.hotelissima.fr/s/h/ile-maurice/mahebourg/astroea-beach.html?searchType=accomodation&searchId=4&guideId=&filters=&withFlights=false&airportCode=PAR&airport=Paris&search=astroea%20beach&startdate=08%2F11%2F2021&stopdate=15%2F11%2F2021&duration=7&travelers=En%20couple&travelType=&rooms%5B0%5D.nbAdults=2&rooms%5B0%5D.nbChilds=0&rooms%5B0%5D.birthdates%5B0%5D=&rooms%5B0%5D.birthdates%5B1%5D=&rooms%5B0%5D.birthdates%5B2%5D=&rooms%5B0%5D.birthdates%5B3%5D=&rooms%5B0%5D.birthdates%5B4%5D=",
"https://www.hotelissima.fr/s/h/ile-maurice/bel-ombre/hotel-outrigger-mauritius.html?searchType=accomodation&searchId=4&guideId=&filters=&withFlights=false&airportCode=PAR&airport=Paris&search=hotel+outrigger+mauritius&startdate=08%2F11%2F2021&stopdate=15%2F11%2F2021&duration=7&travelers=En+couple&travelType=&rooms%5B0%5D.nbAdults=2&rooms%5B0%5D.nbChilds=0&rooms%5B0%5D.birthdates%5B0%5D=&rooms%5B0%5D.birthdates%5B1%5D=&rooms%5B0%5D.birthdates%5B2%5D=&rooms%5B0%5D.birthdates%5B3%5D=&rooms%5B0%5D.birthdates%5B4%5D="
)
entries <- purrr::map(urls, ~ read_html(.x) |> html_elements(".room")) |> unlist(recursive = F)
meal_df <- map_dfr(entries, ~ {
prices <- .x %>%
html_elements(".price") %>%
html_text(trim = T)
meal_text <- .x %>%
html_elements(".price") |>
html_elements(xpath = "./ancestor::div[contains(concat(' ', @class, ' '), 'row')][1]//h4[@class='meal-plan-text']") |>
html_text(trim = T)
new <- paste(meal_text, prices, sep = ":")
if (length(new) > 0) {
as.data.frame(read.dcf(textConnection(new)))
} else {
NULL
}
})
df <- map_df(entries, ~
data.frame(
RoomType = .x |> html_element("h3") |> html_text()
))
listings <- cbind(df, meal_df)
较慢的回答方法。我添加了属性 trim = TRUE
以删除多余的白色 space.
MealPlan
的一个问题是有一些 class .noprice
。 Oneo 排除它们的方法是在 html_nodes
中使用 xpath
而不是 CSS 选择器。我不知道是否有办法用 CSS 选择器来做到这一点。我在下面所做的是提取两者,然后对它们进行设置差异。
对于价格,我使用正则表达式去除了价格中多余的 space。
library(rvest)
library(dplyr)
library(stringr)
url <- "https://www.hotelissima.fr/s/h/ile-maurice/mahebourg/astroea-beach.html?searchType=accomodation&searchId=4&guideId=&filters=&withFlights=false&airportCode=PAR&airport=Paris&search=astroea+beach&startdate=08%2F11%2F2021&stopdate=15%2F11%2F2021&duration=7&travelers=En+couple&travelType=&rooms%5B0%5D.nbAdults=2&rooms%5B0%5D.nbChilds=0&rooms%5B0%5D.birthdates%5B0%5D=&rooms%5B0%5D.birthdates%5B1%5D=&rooms%5B0%5D.birthdates%5B2%5D=&rooms%5B0%5D.birthdates%5B3%5D=&rooms%5B0%5D.birthdates%5B4%5D="
Price <- read_html(url) %>%
html_nodes(".price") %>%
html_text(trim = TRUE) %>%
str_replace("(\d)\s(\d)", "\1\2")
RoomType <- read_html(url) %>%
html_nodes(".room h3") %>%
html_text(trim = TRUE)
AllMealPlans <- read_html(url) %>%
html_nodes(".meal-plan-text") %>%
html_text(trim = TRUE)
MealPlansNoPrice <- read_html(url) %>%
html_nodes(".noprice .meal-plan-text") %>%
html_text(trim = TRUE)
MealPlan <- setdiff(AllMealPlans, MealPlansNoPrice)
NumberMealPlans <- length(MealPlan)
NumberRoomTypes <- length(RoomType)
MealPlanColumn <- MealPlan %>% rep(times=NumberRoomTypes)
RoomTypeColumn <- RoomType %>%
rep(each = NumberMealPlans)
bind_cols(RoomType = RoomTypeColumn, MealPlan = MealPlanColumn, Price = Price)
我是使用 R
进行网络抓取的新手,我需要帮助才能完成这项任务。我正试图从这个特定的网页中抓取数据,但我在整个过程中卡在了一个特定的点。
这是URL:webpage
基本上,我试图从网页中捕获 3 个元素:
(1) 房间类型(css 选择器:.room h3
)
(2) 膳食计划(css 选择器:.meal-plan-title
)
(3) 价格(css 选择器:.price
)
我已经能够从网页中提取这些值。但是我很难匹配网页上显示的值。
我的 R
代码是这样的:
library(rvest)
library(dplyr)
library(stringr)
library(tables)
MealPlan <- read_html("https://www.hotelissima.fr/s/h/ile-maurice/mahebourg/astroea-beach.html?searchType=accomodation&searchId=4&guideId=&filters=&withFlights=false&airportCode=PAR&airport=Paris&search=astroea+beach&startdate=08%2F11%2F2021&stopdate=15%2F11%2F2021&duration=7&travelers=En+couple&travelType=&rooms%5B0%5D.nbAdults=2&rooms%5B0%5D.nbChilds=0&rooms%5B0%5D.birthdates%5B0%5D=&rooms%5B0%5D.birthdates%5B1%5D=&rooms%5B0%5D.birthdates%5B2%5D=&rooms%5B0%5D.birthdates%5B3%5D=&rooms%5B0%5D.birthdates%5B4%5D=") %>%
#html_nodes(".meal-plan-text") %>%
html_nodes(".meal-plan-title") %>%
html_text()
MealPlan
Price <- read_html("https://www.hotelissima.fr/s/h/ile-maurice/mahebourg/astroea-beach.html?searchType=accomodation&searchId=4&guideId=&filters=&withFlights=false&airportCode=PAR&airport=Paris&search=astroea+beach&startdate=08%2F11%2F2021&stopdate=15%2F11%2F2021&duration=7&travelers=En+couple&travelType=&rooms%5B0%5D.nbAdults=2&rooms%5B0%5D.nbChilds=0&rooms%5B0%5D.birthdates%5B0%5D=&rooms%5B0%5D.birthdates%5B1%5D=&rooms%5B0%5D.birthdates%5B2%5D=&rooms%5B0%5D.birthdates%5B3%5D=&rooms%5B0%5D.birthdates%5B4%5D=") %>%
html_nodes(".price") %>%
html_text()
Price
RoomType <- read_html("https://www.hotelissima.fr/s/h/ile-maurice/mahebourg/astroea-beach.html?searchType=accomodation&searchId=4&guideId=&filters=&withFlights=false&airportCode=PAR&airport=Paris&search=astroea+beach&startdate=08%2F11%2F2021&stopdate=15%2F11%2F2021&duration=7&travelers=En+couple&travelType=&rooms%5B0%5D.nbAdults=2&rooms%5B0%5D.nbChilds=0&rooms%5B0%5D.birthdates%5B0%5D=&rooms%5B0%5D.birthdates%5B1%5D=&rooms%5B0%5D.birthdates%5B2%5D=&rooms%5B0%5D.birthdates%5B3%5D=&rooms%5B0%5D.birthdates%5B4%5D=") %>%
html_nodes(".room h3") %>%
html_text()
RoomType
我想在数据框中输出如下:
RoomType MealPlan Price
Chambre Standard Petit Dej.+Diner 584 € / pers
Chambre Standard All inclusive 864 € / pers
Chambre Confort Petit Dej.+Diner 715 € / pers
Chambre Confort All inclusive 995 € / pers
Bungalow Petit Dej.+Diner 781 € / pers
Bungalow All inclusive 1061 € / pers
Chambre Deluxe Petit Dej.+Diner 847 € / pers
Chambre Deluxe All inclusive 1127 € / pers
非常感谢任何帮助。
您可以使用 purrr
中的 map_dfr
生成一个宽 DataFrame,其中包含用于膳食计划的单独列,然后 pivot_longer
将它们放入一个列中,其中包含值的价格信息。您传入 map_dfr
的初始列表将是代表每个房间列表的父元素,使用 css 选择器 .room
.
所提供的 url 的所有房间都具有相同的价格条目组合,即 Petit déj. + diner
和 All inclusive
。为了满足其他页面上的任何内容,您需要确定所有情况,或者首先将所有页面的所有 .room
收集到一个列表中,然后使用 read.dcf 之类的方法来列出所有可能的情况,并在给定列表中缺少的地方输入 N/A。您需要确保为 debian 控制格式的 key:value 配对插入“:”。
library(rvest)
library(purrr)
library(dplyr)
library(tidyr)
page <- read_html("https://www.hotelissima.fr/s/h/ile-maurice/mahebourg/astroea-beach.html?searchType=accomodation&searchId=4&guideId=&filters=&withFlights=false&airportCode=PAR&airport=Paris&search=astroea%20beach&startdate=08%2F11%2F2021&stopdate=15%2F11%2F2021&duration=7&travelers=En%20couple&travelType=&rooms%5B0%5D.nbAdults=2&rooms%5B0%5D.nbChilds=0&rooms%5B0%5D.birthdates%5B0%5D=&rooms%5B0%5D.birthdates%5B1%5D=&rooms%5B0%5D.birthdates%5B2%5D=&rooms%5B0%5D.birthdates%5B3%5D=&rooms%5B0%5D.birthdates%5B4%5D=")
df <- map_dfr(page |> html_elements(".room"), ~
data.frame(
RoomType = .x |> html_element("h3") |> html_text(),
`Petit Dej.+Diner` = .x |> html_element(".price") |> html_text() |> trimws(),
`All inclusive` = .x |> html_element("div:nth-child(5) .price") |> html_text() |> trimws()
)) |>
pivot_longer(!RoomType, names_to = "MealPlan", values_to = "Price")
旧的 R 版本:
library(rvest)
library(purrr)
library(dplyr)
library(tidyr)
page <- read_html("https://www.hotelissima.fr/s/h/ile-maurice/mahebourg/astroea-beach.html?searchType=accomodation&searchId=4&guideId=&filters=&withFlights=false&airportCode=PAR&airport=Paris&search=astroea%20beach&startdate=08%2F11%2F2021&stopdate=15%2F11%2F2021&duration=7&travelers=En%20couple&travelType=&rooms%5B0%5D.nbAdults=2&rooms%5B0%5D.nbChilds=0&rooms%5B0%5D.birthdates%5B0%5D=&rooms%5B0%5D.birthdates%5B1%5D=&rooms%5B0%5D.birthdates%5B2%5D=&rooms%5B0%5D.birthdates%5B3%5D=&rooms%5B0%5D.birthdates%5B4%5D=")
df <- map_dfr(page %>% html_elements(".room"), ~
data.frame(
RoomType = .x %>% html_element("h3") %>% html_text(),
`Petit Dej.+Diner` = .x %>% html_element(".price") %>% html_text() %>% trimws(),
`All inclusive` = .x %>% html_element("div:nth-child(5) .price") %>% html_text() %>% trimws()
)) %>%
pivot_longer(!RoomType, names_to = "MealPlan", values_to = "Price")
read.dcf
处理不同价目表的示例。
对于read.dcf,我采用了@akrun in their answer read.dcf
is used to map out all the meal plans, with a price, present, and put NA where a meal plan is not present for a given entry. For the xpath, I used an example given by @tomalak in their answer here
library(tidyverse)
library(rvest)
urls <- c(
"https://www.hotelissima.fr/s/h/ile-maurice/mahebourg/astroea-beach.html?searchType=accomodation&searchId=4&guideId=&filters=&withFlights=false&airportCode=PAR&airport=Paris&search=astroea%20beach&startdate=08%2F11%2F2021&stopdate=15%2F11%2F2021&duration=7&travelers=En%20couple&travelType=&rooms%5B0%5D.nbAdults=2&rooms%5B0%5D.nbChilds=0&rooms%5B0%5D.birthdates%5B0%5D=&rooms%5B0%5D.birthdates%5B1%5D=&rooms%5B0%5D.birthdates%5B2%5D=&rooms%5B0%5D.birthdates%5B3%5D=&rooms%5B0%5D.birthdates%5B4%5D=",
"https://www.hotelissima.fr/s/h/ile-maurice/bel-ombre/hotel-outrigger-mauritius.html?searchType=accomodation&searchId=4&guideId=&filters=&withFlights=false&airportCode=PAR&airport=Paris&search=hotel+outrigger+mauritius&startdate=08%2F11%2F2021&stopdate=15%2F11%2F2021&duration=7&travelers=En+couple&travelType=&rooms%5B0%5D.nbAdults=2&rooms%5B0%5D.nbChilds=0&rooms%5B0%5D.birthdates%5B0%5D=&rooms%5B0%5D.birthdates%5B1%5D=&rooms%5B0%5D.birthdates%5B2%5D=&rooms%5B0%5D.birthdates%5B3%5D=&rooms%5B0%5D.birthdates%5B4%5D="
)
entries <- purrr::map(urls, ~ read_html(.x) |> html_elements(".room")) |> unlist(recursive = F)
meal_df <- map_dfr(entries, ~ {
prices <- .x %>%
html_elements(".price") %>%
html_text(trim = T)
meal_text <- .x %>%
html_elements(".price") |>
html_elements(xpath = "./ancestor::div[contains(concat(' ', @class, ' '), 'row')][1]//h4[@class='meal-plan-text']") |>
html_text(trim = T)
new <- paste(meal_text, prices, sep = ":")
if (length(new) > 0) {
as.data.frame(read.dcf(textConnection(new)))
} else {
NULL
}
})
df <- map_df(entries, ~
data.frame(
RoomType = .x |> html_element("h3") |> html_text()
))
listings <- cbind(df, meal_df)
较慢的回答方法。我添加了属性 trim = TRUE
以删除多余的白色 space.
MealPlan
的一个问题是有一些 class .noprice
。 Oneo 排除它们的方法是在 html_nodes
中使用 xpath
而不是 CSS 选择器。我不知道是否有办法用 CSS 选择器来做到这一点。我在下面所做的是提取两者,然后对它们进行设置差异。
对于价格,我使用正则表达式去除了价格中多余的 space。
library(rvest)
library(dplyr)
library(stringr)
url <- "https://www.hotelissima.fr/s/h/ile-maurice/mahebourg/astroea-beach.html?searchType=accomodation&searchId=4&guideId=&filters=&withFlights=false&airportCode=PAR&airport=Paris&search=astroea+beach&startdate=08%2F11%2F2021&stopdate=15%2F11%2F2021&duration=7&travelers=En+couple&travelType=&rooms%5B0%5D.nbAdults=2&rooms%5B0%5D.nbChilds=0&rooms%5B0%5D.birthdates%5B0%5D=&rooms%5B0%5D.birthdates%5B1%5D=&rooms%5B0%5D.birthdates%5B2%5D=&rooms%5B0%5D.birthdates%5B3%5D=&rooms%5B0%5D.birthdates%5B4%5D="
Price <- read_html(url) %>%
html_nodes(".price") %>%
html_text(trim = TRUE) %>%
str_replace("(\d)\s(\d)", "\1\2")
RoomType <- read_html(url) %>%
html_nodes(".room h3") %>%
html_text(trim = TRUE)
AllMealPlans <- read_html(url) %>%
html_nodes(".meal-plan-text") %>%
html_text(trim = TRUE)
MealPlansNoPrice <- read_html(url) %>%
html_nodes(".noprice .meal-plan-text") %>%
html_text(trim = TRUE)
MealPlan <- setdiff(AllMealPlans, MealPlansNoPrice)
NumberMealPlans <- length(MealPlan)
NumberRoomTypes <- length(RoomType)
MealPlanColumn <- MealPlan %>% rep(times=NumberRoomTypes)
RoomTypeColumn <- RoomType %>%
rep(each = NumberMealPlans)
bind_cols(RoomType = RoomTypeColumn, MealPlan = MealPlanColumn, Price = Price)