使用 rvest 从网页中抓取特定值
Using rvest to scrape specific values from a web page
我正在使用 R
和 RStudio
。
我正在尝试使用 rvest 包从特定网页抓取数据。下面是网页的部分屏幕截图,其中我有兴趣在红色圆圈中抓取的值。
我对这个 HTML 和元素完全陌生,我很难弄清楚如何在 rvest
中使用相关的 html 标签。使用 Chrome DevTools,我已经能够找出我需要的每个项目在 HTML 代码中的位置。
我正在提供与以下每个项目相关的标签:
Table Headers:
<thead style="width: 547px; top: 0px; z-index: auto;" class="">
<tr class="hprt-table-header">
<th class="hprt-table-header-cell -first" style="width: 134px;">
Accommodation Type
</th>
<th class="hprt-table-header-cell hprt-table-header-price" style="width: 89px;">
Today's Price
</th>
<th class="hprt-table-header-cell hprt-table-header-policies" style="width: 146px;">
Your Choices</th>
标准大床房:
<a class="hprt-roomtype-link" href="#RD27576901" data-room-id="27576901" id="room_type_id_27576901" data-room-name="" data-et-click="">
<span class="hprt-roomtype-icon-link ">
Standard Queen Room
</span>
13,097 毛里求斯卢比:
<div class="bui-price-display__value prco-inline-block-maker-helper prco-f-font-heading " aria-hidden="true" data-et-mouseenter="
customGoal:cCcCcCDUfcXIFbcDcbNXGDJae:2
goal:desktop_room_list_price_column_hover_over_price
">
MUR 13,097
</div>
All-Inclusive:
" id="b_tt_holder_5" aria-describedby="materialized_tooltip_1n6pi">
<span class="bicon-allinclusive mp-icon meal-plan-icon"></span>
<span class="ungreen_keep_green">
All-Inclusive
</span>
高级大床房:
<a class="hprt-roomtype-link" href="#RD27576902" data-room-id="27576902" id="room_type_id_27576902" data-room-name="" data-et-click="">
<span class="hprt-roomtype-icon-link ">
Superior Queen Room
</span>
14,266:
<div class="bui-price-display__value prco-inline-block-maker-helper prco-f-font-heading " aria-hidden="true" data-et-mouseenter="
customGoal:cCcCcCDUfcXIFbcDcbNXGDJae:2
goal:desktop_room_list_price_column_hover_over_price
">
MUR 14,266
</div>
All-Inclusive:
" id="b_tt_holder_9" aria-describedby="materialized_tooltip_n2p5s">
<span class="bicon-allinclusive mp-icon meal-plan-icon"></span>
<span class="ungreen_keep_green">
All-Inclusive
</span>
我想按如下方式将输出转换为数据框:
Accommodation Type Today's Price Your Choices
Standard Queen Room MUR 13,907 All-Inclusive
Superior Queen Room MUR 14,266 All-Inclusive
我的R
代码目前状态如下:
if (!require(rvest)) install.packages('rvest')
library(rvest)
url1 <- read_html("https://www.booking.com/hotel/mu/tamassa.html?aid=356980;label=gog235jc-1DCAsonQFCE2hlcml0YWdlLWF3YWxpLWdvbGZIM1gDaJ0BiAEBmAExuAEXyAEM2AED6AEB-AECiAIBqAIDuAKiwqmEBsACAdICJGFkMTQ3OGU4LTUwZDMtNGQ5ZS1hYzAxLTc0OTIyYTRiZDIxM9gCBOACAQ;sid=729aafddc363c28a2c2c7379d7685d87;all_sr_blocks=36363601_246990918_2_85_0;checkin=2021-09-04;checkout=2021-09-05;dest_id=-1354779;dest_type=city;dist=0;from_beach_key_ufi_sr=1;group_adults=2;group_children=0;hapos=1;highlighted_blocks=36363601_246990918_2_85_0;hp_group_set=0;hpos=1;no_rooms=1;room1=A%2CA;sb_price_type=total;sr_order=popularity;sr_pri_blocks=36363601_246990918_2_85_0__29200;srepoch=1619681695;srpvid=51c8354f03be0097;type=total;ucfs=1&")
非常感谢任何帮助。
这不是一个完整的解决方案,因为这是一项相当复杂的任务。
通常:您可以 select html tags/nodes 与 html_nodes()
并通过指定他们的 class
或 id
参数。在你的情况下,我没有看到 id
,但有 类。对于您使用 .
的 类,ID 将以 #
为前缀,例如".hprt-table-header"
(在下面的代码中使用。)提取文本的代码对于您之后的每个信息块都非常相似 - 只需修改下面的代码即可。一个可能有点困难的问题是找出具有多个“价格”和“选择”值的行。
library(rvest)
#> Loading required package: xml2
url1 <- read_html("https://www.booking.com/hotel/mu/tamassa.html?aid=356980;label=gog235jc-1DCAsonQFCE2hlcml0YWdlLWF3YWxpLWdvbGZIM1gDaJ0BiAEBmAExuAEXyAEM2AED6AEB-AECiAIBqAIDuAKiwqmEBsACAdICJGFkMTQ3OGU4LTUwZDMtNGQ5ZS1hYzAxLTc0OTIyYTRiZDIxM9gCBOACAQ;sid=729aafddc363c28a2c2c7379d7685d87;all_sr_blocks=36363601_246990918_2_85_0;checkin=2021-09-04;checkout=2021-09-05;dest_id=-1354779;dest_type=city;dist=0;from_beach_key_ufi_sr=1;group_adults=2;group_children=0;hapos=1;highlighted_blocks=36363601_246990918_2_85_0;hp_group_set=0;hpos=1;no_rooms=1;room1=A%2CA;sb_price_type=total;sr_order=popularity;sr_pri_blocks=36363601_246990918_2_85_0__29200;srepoch=1619681695;srpvid=51c8354f03be0097;type=total;ucfs=1&")
Table Headers
url1 %>%
html_nodes(".hprt-table-header") %>%
html_text() %>%
strsplit("\n") %>%
unlist() %>%
.[. != ""] %>%
gsub("\n", "", .) %>%
.[-5]
#> [1] "Accommodation Type" "Sleeps" "Today's price"
#> [4] "Your choices" "Quantity"
房型
url1 %>%
html_nodes(".hprt-roomtype-icon-link") %>%
html_text() %>%
strsplit("\n") %>%
unlist() %>%
.[. != ""]
#> [1] "Standard Queen Room" "Superior Queen Room" "Deluxe Family Room"
#> [4] "Triple Room"
价格
url1 %>%
html_nodes(".bui-price-display__value") %>%
html_text() %>%
strsplit("\n") %>%
unlist() %>%
.[. != ""] %>%
gsub("\n", "", .)
#> [1] "US5" "US1" "US4" "US0" "US2" "US7" "US8" "US3"
请注意,在从网站上抓取大量数据之前,您应该确认您没有将自己置于法律危险之中。
这是检索 table 价格然后执行一些数据清理的解决方案:
仍需要一些额外的清理,但大部分已完成。
library(rvest)
library(dplyr)
library(stringr)
url1 <- read_html("https://www.booking.com/hotel/mu/tamassa.html?aid=356980;label=gog235jc-1DCAsonQFCE2hlcml0YWdlLWF3YWxpLWdvbGZIM1gDaJ0BiAEBmAExuAEXyAEM2AED6AEB-AECiAIBqAIDuAKiwqmEBsACAdICJGFkMTQ3OGU4LTUwZDMtNGQ5ZS1hYzAxLTc0OTIyYTRiZDIxM9gCBOACAQ;sid=729aafddc363c28a2c2c7379d7685d87;all_sr_blocks=36363601_246990918_2_85_0;checkin=2021-09-04;checkout=2021-09-05;dest_id=-1354779;dest_type=city;dist=0;from_beach_key_ufi_sr=1;group_adults=2;group_children=0;hapos=1;highlighted_blocks=36363601_246990918_2_85_0;hp_group_set=0;hpos=1;no_rooms=1;room1=A%2CA;sb_price_type=total;sr_order=popularity;sr_pri_blocks=36363601_246990918_2_85_0__29200;srepoch=1619681695;srpvid=51c8354f03be0097;type=total;ucfs=1&")
output <- url1 %>%
html_nodes(xpath = './/table[@id="hprt-table"]') %>%
html_table() %>% .[[1]]
#Fix column name
colnames(output)[5] <- "Quantity"
#Clean up columns
#remove repeating information in 2 columns
output2 <- output %>% mutate_at(c("Accommodation Type", "Today's price"), ~str_extract(., ".*\n"))
#Remove repeating newlines
answer<-output2 %>% mutate_all(str_squish)
answer
# A tibble: 8 x 5
`Accommodation Ty… Sleeps `Today's price` `Your choices` Quantity
<chr> <chr> <chr> <chr> <chr>
1 Triple Room Max persons: 3 US8 All-Inclusive FREE cancellation before 23:59 on 27 August 2021 More details on … Select rooms 0 1 (US8) 2 (US5) 3 (US,193) 4 (US$…
2 Triple Room Max persons: 1 … US3 All-Inclusive FREE cancellation before 23:59 on 27 August 2021 More details on … Select rooms 0 1 (US3) 2 (US6) 3 (US9) 4 (US,…
3 Standard Queen Ro… Max persons: 2 US5 All-Inclusive FREE cancellation before 23:59 on 27 August 2021 More details on … Select rooms 0 1 (US5) 2 (US0) 3 (US6) 4 (US,…
4 Standard Queen Ro… Max persons: 1 … US1 All-Inclusive FREE cancellation before 23:59 on 27 August 2021 More details on … Select rooms 0 1 (US1) 2 (US1) 3 (US2) 4 (US…
5 Superior Queen Ro… Max persons: 2 US4 All-Inclusive FREE cancellation before 23:59 on 27 August 2021 More details on … Select rooms 0 1 (US4) 2 (US8) 3 (US,063) 4 (US$…
6 Superior Queen Ro… Max persons: 1 … US0 All-Inclusive FREE cancellation before 23:59 on 27 August 2021 More details on … Select rooms 0 1 (US0) 2 (US9) 3 (US9) 4 (US,…
7 Deluxe Family Room Max persons: 2 US2 All-Inclusive FREE cancellation before 23:59 on 27 August 2021 More details on … Select rooms 0 1 (US2) 2 (US,064) 3 (US,596) 4 (U…
8 Deluxe Family Room Max persons: 1 … US7 All-Inclusive FREE cancellation before 23:59 on 27 August 2021 More details on … Select rooms 0 1 (US7) 2 (US5) 3 (US,342) 4 (US$…
我正在使用 R
和 RStudio
。
我正在尝试使用 rvest 包从特定网页抓取数据。下面是网页的部分屏幕截图,其中我有兴趣在红色圆圈中抓取的值。
我对这个 HTML 和元素完全陌生,我很难弄清楚如何在 rvest
中使用相关的 html 标签。使用 Chrome DevTools,我已经能够找出我需要的每个项目在 HTML 代码中的位置。
我正在提供与以下每个项目相关的标签:
Table Headers:
<thead style="width: 547px; top: 0px; z-index: auto;" class="">
<tr class="hprt-table-header">
<th class="hprt-table-header-cell -first" style="width: 134px;">
Accommodation Type
</th>
<th class="hprt-table-header-cell hprt-table-header-price" style="width: 89px;">
Today's Price
</th>
<th class="hprt-table-header-cell hprt-table-header-policies" style="width: 146px;">
Your Choices</th>
标准大床房:
<a class="hprt-roomtype-link" href="#RD27576901" data-room-id="27576901" id="room_type_id_27576901" data-room-name="" data-et-click="">
<span class="hprt-roomtype-icon-link ">
Standard Queen Room
</span>
13,097 毛里求斯卢比:
<div class="bui-price-display__value prco-inline-block-maker-helper prco-f-font-heading " aria-hidden="true" data-et-mouseenter="
customGoal:cCcCcCDUfcXIFbcDcbNXGDJae:2
goal:desktop_room_list_price_column_hover_over_price
">
MUR 13,097
</div>
All-Inclusive:
" id="b_tt_holder_5" aria-describedby="materialized_tooltip_1n6pi">
<span class="bicon-allinclusive mp-icon meal-plan-icon"></span>
<span class="ungreen_keep_green">
All-Inclusive
</span>
高级大床房:
<a class="hprt-roomtype-link" href="#RD27576902" data-room-id="27576902" id="room_type_id_27576902" data-room-name="" data-et-click="">
<span class="hprt-roomtype-icon-link ">
Superior Queen Room
</span>
14,266:
<div class="bui-price-display__value prco-inline-block-maker-helper prco-f-font-heading " aria-hidden="true" data-et-mouseenter="
customGoal:cCcCcCDUfcXIFbcDcbNXGDJae:2
goal:desktop_room_list_price_column_hover_over_price
">
MUR 14,266
</div>
All-Inclusive:
" id="b_tt_holder_9" aria-describedby="materialized_tooltip_n2p5s">
<span class="bicon-allinclusive mp-icon meal-plan-icon"></span>
<span class="ungreen_keep_green">
All-Inclusive
</span>
我想按如下方式将输出转换为数据框:
Accommodation Type Today's Price Your Choices
Standard Queen Room MUR 13,907 All-Inclusive
Superior Queen Room MUR 14,266 All-Inclusive
我的R
代码目前状态如下:
if (!require(rvest)) install.packages('rvest')
library(rvest)
url1 <- read_html("https://www.booking.com/hotel/mu/tamassa.html?aid=356980;label=gog235jc-1DCAsonQFCE2hlcml0YWdlLWF3YWxpLWdvbGZIM1gDaJ0BiAEBmAExuAEXyAEM2AED6AEB-AECiAIBqAIDuAKiwqmEBsACAdICJGFkMTQ3OGU4LTUwZDMtNGQ5ZS1hYzAxLTc0OTIyYTRiZDIxM9gCBOACAQ;sid=729aafddc363c28a2c2c7379d7685d87;all_sr_blocks=36363601_246990918_2_85_0;checkin=2021-09-04;checkout=2021-09-05;dest_id=-1354779;dest_type=city;dist=0;from_beach_key_ufi_sr=1;group_adults=2;group_children=0;hapos=1;highlighted_blocks=36363601_246990918_2_85_0;hp_group_set=0;hpos=1;no_rooms=1;room1=A%2CA;sb_price_type=total;sr_order=popularity;sr_pri_blocks=36363601_246990918_2_85_0__29200;srepoch=1619681695;srpvid=51c8354f03be0097;type=total;ucfs=1&")
非常感谢任何帮助。
这不是一个完整的解决方案,因为这是一项相当复杂的任务。
通常:您可以 select html tags/nodes 与 html_nodes()
并通过指定他们的 class
或 id
参数。在你的情况下,我没有看到 id
,但有 类。对于您使用 .
的 类,ID 将以 #
为前缀,例如".hprt-table-header"
(在下面的代码中使用。)提取文本的代码对于您之后的每个信息块都非常相似 - 只需修改下面的代码即可。一个可能有点困难的问题是找出具有多个“价格”和“选择”值的行。
library(rvest)
#> Loading required package: xml2
url1 <- read_html("https://www.booking.com/hotel/mu/tamassa.html?aid=356980;label=gog235jc-1DCAsonQFCE2hlcml0YWdlLWF3YWxpLWdvbGZIM1gDaJ0BiAEBmAExuAEXyAEM2AED6AEB-AECiAIBqAIDuAKiwqmEBsACAdICJGFkMTQ3OGU4LTUwZDMtNGQ5ZS1hYzAxLTc0OTIyYTRiZDIxM9gCBOACAQ;sid=729aafddc363c28a2c2c7379d7685d87;all_sr_blocks=36363601_246990918_2_85_0;checkin=2021-09-04;checkout=2021-09-05;dest_id=-1354779;dest_type=city;dist=0;from_beach_key_ufi_sr=1;group_adults=2;group_children=0;hapos=1;highlighted_blocks=36363601_246990918_2_85_0;hp_group_set=0;hpos=1;no_rooms=1;room1=A%2CA;sb_price_type=total;sr_order=popularity;sr_pri_blocks=36363601_246990918_2_85_0__29200;srepoch=1619681695;srpvid=51c8354f03be0097;type=total;ucfs=1&")
Table Headers
url1 %>%
html_nodes(".hprt-table-header") %>%
html_text() %>%
strsplit("\n") %>%
unlist() %>%
.[. != ""] %>%
gsub("\n", "", .) %>%
.[-5]
#> [1] "Accommodation Type" "Sleeps" "Today's price"
#> [4] "Your choices" "Quantity"
房型
url1 %>%
html_nodes(".hprt-roomtype-icon-link") %>%
html_text() %>%
strsplit("\n") %>%
unlist() %>%
.[. != ""]
#> [1] "Standard Queen Room" "Superior Queen Room" "Deluxe Family Room"
#> [4] "Triple Room"
价格
url1 %>%
html_nodes(".bui-price-display__value") %>%
html_text() %>%
strsplit("\n") %>%
unlist() %>%
.[. != ""] %>%
gsub("\n", "", .)
#> [1] "US5" "US1" "US4" "US0" "US2" "US7" "US8" "US3"
请注意,在从网站上抓取大量数据之前,您应该确认您没有将自己置于法律危险之中。
这是检索 table 价格然后执行一些数据清理的解决方案:
仍需要一些额外的清理,但大部分已完成。
library(rvest)
library(dplyr)
library(stringr)
url1 <- read_html("https://www.booking.com/hotel/mu/tamassa.html?aid=356980;label=gog235jc-1DCAsonQFCE2hlcml0YWdlLWF3YWxpLWdvbGZIM1gDaJ0BiAEBmAExuAEXyAEM2AED6AEB-AECiAIBqAIDuAKiwqmEBsACAdICJGFkMTQ3OGU4LTUwZDMtNGQ5ZS1hYzAxLTc0OTIyYTRiZDIxM9gCBOACAQ;sid=729aafddc363c28a2c2c7379d7685d87;all_sr_blocks=36363601_246990918_2_85_0;checkin=2021-09-04;checkout=2021-09-05;dest_id=-1354779;dest_type=city;dist=0;from_beach_key_ufi_sr=1;group_adults=2;group_children=0;hapos=1;highlighted_blocks=36363601_246990918_2_85_0;hp_group_set=0;hpos=1;no_rooms=1;room1=A%2CA;sb_price_type=total;sr_order=popularity;sr_pri_blocks=36363601_246990918_2_85_0__29200;srepoch=1619681695;srpvid=51c8354f03be0097;type=total;ucfs=1&")
output <- url1 %>%
html_nodes(xpath = './/table[@id="hprt-table"]') %>%
html_table() %>% .[[1]]
#Fix column name
colnames(output)[5] <- "Quantity"
#Clean up columns
#remove repeating information in 2 columns
output2 <- output %>% mutate_at(c("Accommodation Type", "Today's price"), ~str_extract(., ".*\n"))
#Remove repeating newlines
answer<-output2 %>% mutate_all(str_squish)
answer
# A tibble: 8 x 5
`Accommodation Ty… Sleeps `Today's price` `Your choices` Quantity
<chr> <chr> <chr> <chr> <chr>
1 Triple Room Max persons: 3 US8 All-Inclusive FREE cancellation before 23:59 on 27 August 2021 More details on … Select rooms 0 1 (US8) 2 (US5) 3 (US,193) 4 (US$…
2 Triple Room Max persons: 1 … US3 All-Inclusive FREE cancellation before 23:59 on 27 August 2021 More details on … Select rooms 0 1 (US3) 2 (US6) 3 (US9) 4 (US,…
3 Standard Queen Ro… Max persons: 2 US5 All-Inclusive FREE cancellation before 23:59 on 27 August 2021 More details on … Select rooms 0 1 (US5) 2 (US0) 3 (US6) 4 (US,…
4 Standard Queen Ro… Max persons: 1 … US1 All-Inclusive FREE cancellation before 23:59 on 27 August 2021 More details on … Select rooms 0 1 (US1) 2 (US1) 3 (US2) 4 (US…
5 Superior Queen Ro… Max persons: 2 US4 All-Inclusive FREE cancellation before 23:59 on 27 August 2021 More details on … Select rooms 0 1 (US4) 2 (US8) 3 (US,063) 4 (US$…
6 Superior Queen Ro… Max persons: 1 … US0 All-Inclusive FREE cancellation before 23:59 on 27 August 2021 More details on … Select rooms 0 1 (US0) 2 (US9) 3 (US9) 4 (US,…
7 Deluxe Family Room Max persons: 2 US2 All-Inclusive FREE cancellation before 23:59 on 27 August 2021 More details on … Select rooms 0 1 (US2) 2 (US,064) 3 (US,596) 4 (U…
8 Deluxe Family Room Max persons: 1 … US7 All-Inclusive FREE cancellation before 23:59 on 27 August 2021 More details on … Select rooms 0 1 (US7) 2 (US5) 3 (US,342) 4 (US$…