使用 rvest 抓取多个页面

Using rvest to webscrape multiple pages

我正在尝试提取 Melania Trump 从 2016-2020 年在以下 link: https://www.presidency.ucsb.edu/documents/presidential-documents-archive-guidebook/remarks-and-statements-the-first-lady-laura-bush 发表的所有演讲。我正在尝试使用 rvest 来这样做。到目前为止,这是我的代码:

# get main link
link <- "https://www.presidency.ucsb.edu/documents/presidential-documents-archive-guidebook/remarks-and-statements-the-first-lady-laura-bush"

# main page
page <- read_html(link)

# extract speech titles
title <- page %>% html_nodes("td.views-field-title") %>% html_text()
title_links = page %>% html_nodes("td.views-field-title") %>%
  html_attr("href") %>% paste("https://www.presidency.ucsb.edu/",., sep="")
title_links

# extract year of speech
year <- page %>% html_nodes(".date-display-single") %>% html_text()

# extract name of person giving speech
flotus <- page %>% html_nodes(".views-field-title-1.nowrap") %>% html_text()

get_text <- function(title_link){
  speech_page = read_html(title_links)
  speech_text = speech_page %>% html_nodes(".field-docs-content p") %>%
  html_text()  %>% paste(collapse = ",")
  return(speech_page)
}

text = sapply(title_links, FUN = get_text)

我在使用以下代码行时遇到问题:

title <- page %>% html_nodes("td.views-field-title") %>% html_text()
title_links = page %>% html_nodes("td.views-field-title") %>%
  html_attr("href") %>% paste("https://www.presidency.ucsb.edu/",., sep="")
title_links

特别是,title_links 会生成一系列 link,如下所示:"https://www.presidency.ucsb.eduNA",而不是单个网页。有谁知道我在这里做错了什么?任何帮助将不胜感激。

您正在查询错误的 css 节点。 尝试:

page %>% html_elements(css = "td.views-field-title a") %>% html_attr('href')


 [1] "https://www.presidency.ucsb.edu/documents/remarks-mrs-laura-bush-the-national-press-club"                                            
 [2] "https://www.presidency.ucsb.edu/documents/remarks-the-first-lady-un-commission-the-status-women-international-womens-day"            
 [3] "https://www.presidency.ucsb.edu/documents/remarks-the-first-lady-the-colorado-early-childhood-cognitive-development-summit"          
 [4] "https://www.presidency.ucsb.edu/documents/remarks-the-first-lady-the-10th-anniversary-the-holocaust-memorial-museum-and-opening-anne"
 [5] "https://www.presidency.ucsb.edu/documents/remarks-the-first-lady-the-preserve-america-initiative-portland-maine"