如何从这个网页的 table 的单元格中提取这个特定值？

Question

我正在尝试抓取此网页上 table 中的数据（URL 在下面的 R 代码中提供）。

我的 R 代码确实提取了 Table 但我想 trim 在名为“您的选择”的列中找到的信息

我不想要名为“您的选择”的列中的所有数据。我只想提取第一个“\n”之前的那些文本。

这是我的 R 代码：

library(tidyverse)

content <- read_html("https://www.booking.com/hotel/mu/tamarin.en-gb.html?aid=356980&label=gog235jc-1DCAsonQFCE2hlcml0YWdlLWF3YWxpLWdvbGZIM1gDaJ0BiAEBmAExuAEXyAEM2AED6AEB-AECiAIBqAIDuAKiwqmEBsACAdICJGFkMTQ3OGU4LTUwZDMtNGQ5ZS1hYzAxLTc0OTIyYTRiZDIxM9gCBOACAQ&sid=729aafddc363c28a2c2c7379d7685d87&all_sr_blocks=36363601_246990918_2_85_0&checkin=2021-11-15&checkout=2021-11-20&dest_id=-1354779&dest_type=city&dist=0&from_beach_key_ufi_sr=1&group_adults=2&group_children=0&hapos=1&highlighted_blocks=36363601_246990918_2_85_0&hp_group_set=0&hpos=1&no_rooms=1&sb_price_type=total&sr_order=popularity&sr_pri_blocks=36363601_246990918_2_85_0__29200&srepoch=1619681695&srpvid=51c8354f03be0097&type=total&ucfs=1&req_children=0&req_adults=2&hp_refreshed_with_new_dates=1")

tables <- content %>% html_table(fill = TRUE)

View(tables)


second_table <- tables[[2]]

View(second_table)

在 RStudio 中，“您的选择”列中的数据如下（显示为摘录）：

1 月 12 日免费 cancellation\nbefore 23:59 2021\n\n\n\n\n\n\n\n\n\n10% 基本折扣 available\n\n\n\n\n\n\nMeals:\n此房间没有用餐选项。\n\n\nCancellation:\n\nYou可在抵达前2天免费取消。如果您在抵达前 2 天内取消预订，您将被收取预订总价。如果您没有出现，您将被收取预订的总价。\n\n\nPrepayment:\n您将随时被收取总价的预付款。

2 好早餐 included\n\n\n\n\n\n\n\n\nFREE cancellation\nbefore 23:59 11 月 12 日 2021\n\n\n\n\n\n\n\n\n\n10% 基准折扣 available\n\n\n\n\n\n\nMeals:\n欧陆式早餐 included\nBreakfast评分 7.6 - 基于 38 条评论。\n\n\n\nCancellation:\n\nYou 可在抵达前 2 天免费取消。如果您在抵达前 2 天内取消预订，您将被收取预订总价。如果您没有出现，您将被收取预订的总价。\n\n\nPrepayment:\n您将随时被收取总价的预付款。

3 早餐和晚餐 included\n\n\n\n\n\n\n\n\n\nFREE cancellation\nbefore 23:59 11 月 12 日 2021\n\n\n\n\n\n\n\n\n\n10% 基本折扣 available\n\n\n\n\n\n\nMeals：\n半食宿包含在房价。\n早餐评分 7.6 - 基于 38 条评论。\n\n\n\nCancellation:\n\nYou 可在抵达前 2 天免费取消。如果您在抵达前 2 天内取消预订，您将被收取预订总价。如果您没有出现，您将被收取预订的总价。\n\n\nPrepayment:\n您将随时被收取总价的预付款。

为简化起见，我希望“您的选择”列如下所示：

  Your Choices

FREE cancellation
Good breakfast included
Breakfast & dinner included

我怎样才能做到这一点？

Answer 1

您可以使用 sub 删除第一个 "\n" -

之后的所有内容

library(tidyverse)
library(rvest)

tables <- content %>% html_table(fill = TRUE)
second_table <- tables[[2]]
second_table$`Your choices` <- sub('\n.*', '', second_table$`Your choices`)

second_table$`Your choices`

# [1] "FREE cancellation"           "Good breakfast included"    
# [3] "Breakfast & dinner included" "All-Inclusive"              
# [5] "Breakfast & dinner included" "All-Inclusive"              
# [7] "FREE cancellation"           "Breakfast & dinner included"
# [9] "Good breakfast included"     "All-Inclusive"              
#[11] "Breakfast & dinner included" "All-Inclusive"              
#[13] "FREE cancellation"           "Good breakfast included"    
#[15] "Breakfast & dinner included" "All-Inclusive"              
#[17] "Breakfast & dinner included" "FREE cancellation"          
#[19] "Good breakfast included"     "All-Inclusive"              
#[21] "Breakfast & dinner included" "All-Inclusive"

如何从这个网页的 table 的单元格中提取这个特定值？

How to extract this specific value from the cell of a table from this webpage?

r

web-scraping

rvest