使用 rvest 从网页中抓取特定值

Using rvest to scrape specific values from a web page

我正在使用 RRStudio。 我正在尝试使用 rvest 包从特定网页抓取数据。下面是网页的部分屏幕截图,其中我有兴趣在红色圆圈中抓取的值。

我对这个 HTML 和元素完全陌生,我很难弄清楚如何在 rvest 中使用相关的 html 标签。使用 Chrome DevTools,我已经能够找出我需要的每个项目在 HTML 代码中的位置。

我正在提供与以下每个项目相关的标签:

Table Headers:

<thead style="width: 547px; top: 0px; z-index: auto;" class="">
<tr class="hprt-table-header">
<th class="hprt-table-header-cell -first" style="width: 134px;">
Accommodation Type
</th>
<th class="hprt-table-header-cell hprt-table-header-price" style="width: 89px;">
Today's Price
</th>
<th class="hprt-table-header-cell hprt-table-header-policies" style="width: 146px;">
Your Choices</th>

标准大床房:

<a class="hprt-roomtype-link" href="#RD27576901" data-room-id="27576901" id="room_type_id_27576901" data-room-name="" data-et-click="">
<span class="hprt-roomtype-icon-link ">
Standard Queen Room
</span>

13,097 毛里求斯卢比:

<div class="bui-price-display__value prco-inline-block-maker-helper prco-f-font-heading " aria-hidden="true" data-et-mouseenter="
customGoal:cCcCcCDUfcXIFbcDcbNXGDJae:2
goal:desktop_room_list_price_column_hover_over_price
">
MUR&nbsp;13,097
</div>

All-Inclusive:

" id="b_tt_holder_5" aria-describedby="materialized_tooltip_1n6pi">
<span class="bicon-allinclusive mp-icon meal-plan-icon"></span>
<span class="ungreen_keep_green">
All-Inclusive
</span>

高级大床房:

<a class="hprt-roomtype-link" href="#RD27576902" data-room-id="27576902" id="room_type_id_27576902" data-room-name="" data-et-click="">
<span class="hprt-roomtype-icon-link ">
Superior Queen Room
</span>

14,266:

<div class="bui-price-display__value prco-inline-block-maker-helper prco-f-font-heading " aria-hidden="true" data-et-mouseenter="
customGoal:cCcCcCDUfcXIFbcDcbNXGDJae:2
goal:desktop_room_list_price_column_hover_over_price
">
MUR&nbsp;14,266
</div>

All-Inclusive:

" id="b_tt_holder_9" aria-describedby="materialized_tooltip_n2p5s">
<span class="bicon-allinclusive mp-icon meal-plan-icon"></span>
<span class="ungreen_keep_green">
All-Inclusive
</span>

我想按如下方式将输出转换为数据框:

 Accommodation Type      Today's Price  Your Choices
 Standard Queen Room      MUR 13,907    All-Inclusive
 Superior Queen Room      MUR 14,266    All-Inclusive 

我的R代码目前状态如下:

if (!require(rvest)) install.packages('rvest')

library(rvest)

url1 <- read_html("https://www.booking.com/hotel/mu/tamassa.html?aid=356980;label=gog235jc-1DCAsonQFCE2hlcml0YWdlLWF3YWxpLWdvbGZIM1gDaJ0BiAEBmAExuAEXyAEM2AED6AEB-AECiAIBqAIDuAKiwqmEBsACAdICJGFkMTQ3OGU4LTUwZDMtNGQ5ZS1hYzAxLTc0OTIyYTRiZDIxM9gCBOACAQ;sid=729aafddc363c28a2c2c7379d7685d87;all_sr_blocks=36363601_246990918_2_85_0;checkin=2021-09-04;checkout=2021-09-05;dest_id=-1354779;dest_type=city;dist=0;from_beach_key_ufi_sr=1;group_adults=2;group_children=0;hapos=1;highlighted_blocks=36363601_246990918_2_85_0;hp_group_set=0;hpos=1;no_rooms=1;room1=A%2CA;sb_price_type=total;sr_order=popularity;sr_pri_blocks=36363601_246990918_2_85_0__29200;srepoch=1619681695;srpvid=51c8354f03be0097;type=total;ucfs=1&")     

非常感谢任何帮助。

这不是一个完整的解决方案,因为这是一项相当复杂的任务。

通常:您可以 select html tags/nodes 与 html_nodes() 并通过指定他们的 classid 参数。在你的情况下,我没有看到 id,但有 类。对于您使用 . 的 类,ID 将以 # 为前缀,例如".hprt-table-header"(在下面的代码中使用。)提取文本的代码对于您之后的每个信息块都非常相似 - 只需修改下面的代码即可。一个可能有点困难的问题是找出具有多个“价格”和“选择”值的行。

library(rvest)
#> Loading required package: xml2

url1 <- read_html("https://www.booking.com/hotel/mu/tamassa.html?aid=356980;label=gog235jc-1DCAsonQFCE2hlcml0YWdlLWF3YWxpLWdvbGZIM1gDaJ0BiAEBmAExuAEXyAEM2AED6AEB-AECiAIBqAIDuAKiwqmEBsACAdICJGFkMTQ3OGU4LTUwZDMtNGQ5ZS1hYzAxLTc0OTIyYTRiZDIxM9gCBOACAQ;sid=729aafddc363c28a2c2c7379d7685d87;all_sr_blocks=36363601_246990918_2_85_0;checkin=2021-09-04;checkout=2021-09-05;dest_id=-1354779;dest_type=city;dist=0;from_beach_key_ufi_sr=1;group_adults=2;group_children=0;hapos=1;highlighted_blocks=36363601_246990918_2_85_0;hp_group_set=0;hpos=1;no_rooms=1;room1=A%2CA;sb_price_type=total;sr_order=popularity;sr_pri_blocks=36363601_246990918_2_85_0__29200;srepoch=1619681695;srpvid=51c8354f03be0097;type=total;ucfs=1&")     

Table Headers

url1 %>% 
  html_nodes(".hprt-table-header") %>% 
  html_text() %>% 
  strsplit("\n") %>% 
  unlist() %>% 
  .[. != ""] %>% 
  gsub("\n", "", .) %>% 
  .[-5]
#> [1] "Accommodation Type" "Sleeps"             "Today's price"     
#> [4] "Your choices"       "Quantity"

房型

url1 %>% 
  html_nodes(".hprt-roomtype-icon-link") %>% 
  html_text() %>% 
  strsplit("\n") %>% 
  unlist() %>% 
  .[. != ""]
#> [1] "Standard Queen Room" "Superior Queen Room" "Deluxe Family Room" 
#> [4] "Triple Room"

价格

url1 %>% 
  html_nodes(".bui-price-display__value") %>% 
  html_text() %>% 
  strsplit("\n") %>% 
  unlist() %>% 
  .[. != ""] %>% 
  gsub("\n", "", .) 
#> [1] "US5" "US1" "US4" "US0" "US2" "US7" "US8" "US3"

请注意,在从网站上抓取大量数据之前,您应该确认您没有将自己置于法律危险之中。

这是检索 table 价格然后执行一些数据清理的解决方案:

仍需要一些额外的清理,但大部分已完成。

library(rvest)
library(dplyr)
library(stringr)

url1 <- read_html("https://www.booking.com/hotel/mu/tamassa.html?aid=356980;label=gog235jc-1DCAsonQFCE2hlcml0YWdlLWF3YWxpLWdvbGZIM1gDaJ0BiAEBmAExuAEXyAEM2AED6AEB-AECiAIBqAIDuAKiwqmEBsACAdICJGFkMTQ3OGU4LTUwZDMtNGQ5ZS1hYzAxLTc0OTIyYTRiZDIxM9gCBOACAQ;sid=729aafddc363c28a2c2c7379d7685d87;all_sr_blocks=36363601_246990918_2_85_0;checkin=2021-09-04;checkout=2021-09-05;dest_id=-1354779;dest_type=city;dist=0;from_beach_key_ufi_sr=1;group_adults=2;group_children=0;hapos=1;highlighted_blocks=36363601_246990918_2_85_0;hp_group_set=0;hpos=1;no_rooms=1;room1=A%2CA;sb_price_type=total;sr_order=popularity;sr_pri_blocks=36363601_246990918_2_85_0__29200;srepoch=1619681695;srpvid=51c8354f03be0097;type=total;ucfs=1&") 

output <- url1 %>% 
   html_nodes(xpath = './/table[@id="hprt-table"]')  %>% 
   html_table() %>% .[[1]]

    
#Fix column name
colnames(output)[5] <- "Quantity"

#Clean up columns
#remove repeating information in 2 columns
output2 <- output %>% mutate_at(c("Accommodation Type", "Today's price"), ~str_extract(., ".*\n"))
#Remove repeating newlines
answer<-output2 %>% mutate_all(str_squish)

answer
# A tibble: 8 x 5
  `Accommodation Ty… Sleeps           `Today's price` `Your choices`                                                                   Quantity                                                 
  <chr>              <chr>            <chr>           <chr>                                                                            <chr>                                                    
1 Triple Room        Max persons: 3   US8          All-Inclusive FREE cancellation before 23:59 on 27 August 2021 More details on … Select rooms 0 1 (US8) 2 (US5) 3 (US,193) 4 (US$…
2 Triple Room        Max persons: 1 … US3          All-Inclusive FREE cancellation before 23:59 on 27 August 2021 More details on … Select rooms 0 1 (US3) 2 (US6) 3 (US9) 4 (US,…
3 Standard Queen Ro… Max persons: 2   US5          All-Inclusive FREE cancellation before 23:59 on 27 August 2021 More details on … Select rooms 0 1 (US5) 2 (US0) 3 (US6) 4 (US,…
4 Standard Queen Ro… Max persons: 1 … US1          All-Inclusive FREE cancellation before 23:59 on 27 August 2021 More details on … Select rooms 0 1 (US1) 2 (US1) 3 (US2) 4 (US…
5 Superior Queen Ro… Max persons: 2   US4          All-Inclusive FREE cancellation before 23:59 on 27 August 2021 More details on … Select rooms 0 1 (US4) 2 (US8) 3 (US,063) 4 (US$…
6 Superior Queen Ro… Max persons: 1 … US0          All-Inclusive FREE cancellation before 23:59 on 27 August 2021 More details on … Select rooms 0 1 (US0) 2 (US9) 3 (US9) 4 (US,…
7 Deluxe Family Room Max persons: 2   US2          All-Inclusive FREE cancellation before 23:59 on 27 August 2021 More details on … Select rooms 0 1 (US2) 2 (US,064) 3 (US,596) 4 (U…
8 Deluxe Family Room Max persons: 1 … US7          All-Inclusive FREE cancellation before 23:59 on 27 August 2021 More details on … Select rooms 0 1 (US7) 2 (US5) 3 (US,342) 4 (US$…