使用 rvest 抓取多个页面
Using rvest to webscrape multiple pages
我正在尝试提取 Melania Trump 从 2016-2020 年在以下 link: https://www.presidency.ucsb.edu/documents/presidential-documents-archive-guidebook/remarks-and-statements-the-first-lady-laura-bush 发表的所有演讲。我正在尝试使用 rvest
来这样做。到目前为止,这是我的代码:
# get main link
link <- "https://www.presidency.ucsb.edu/documents/presidential-documents-archive-guidebook/remarks-and-statements-the-first-lady-laura-bush"
# main page
page <- read_html(link)
# extract speech titles
title <- page %>% html_nodes("td.views-field-title") %>% html_text()
title_links = page %>% html_nodes("td.views-field-title") %>%
html_attr("href") %>% paste("https://www.presidency.ucsb.edu/",., sep="")
title_links
# extract year of speech
year <- page %>% html_nodes(".date-display-single") %>% html_text()
# extract name of person giving speech
flotus <- page %>% html_nodes(".views-field-title-1.nowrap") %>% html_text()
get_text <- function(title_link){
speech_page = read_html(title_links)
speech_text = speech_page %>% html_nodes(".field-docs-content p") %>%
html_text() %>% paste(collapse = ",")
return(speech_page)
}
text = sapply(title_links, FUN = get_text)
我在使用以下代码行时遇到问题:
title <- page %>% html_nodes("td.views-field-title") %>% html_text()
title_links = page %>% html_nodes("td.views-field-title") %>%
html_attr("href") %>% paste("https://www.presidency.ucsb.edu/",., sep="")
title_links
特别是,title_links
会生成一系列 link,如下所示:"https://www.presidency.ucsb.eduNA"
,而不是单个网页。有谁知道我在这里做错了什么?任何帮助将不胜感激。
您正在查询错误的 css 节点。
尝试:
page %>% html_elements(css = "td.views-field-title a") %>% html_attr('href')
[1] "https://www.presidency.ucsb.edu/documents/remarks-mrs-laura-bush-the-national-press-club"
[2] "https://www.presidency.ucsb.edu/documents/remarks-the-first-lady-un-commission-the-status-women-international-womens-day"
[3] "https://www.presidency.ucsb.edu/documents/remarks-the-first-lady-the-colorado-early-childhood-cognitive-development-summit"
[4] "https://www.presidency.ucsb.edu/documents/remarks-the-first-lady-the-10th-anniversary-the-holocaust-memorial-museum-and-opening-anne"
[5] "https://www.presidency.ucsb.edu/documents/remarks-the-first-lady-the-preserve-america-initiative-portland-maine"
我正在尝试提取 Melania Trump 从 2016-2020 年在以下 link: https://www.presidency.ucsb.edu/documents/presidential-documents-archive-guidebook/remarks-and-statements-the-first-lady-laura-bush 发表的所有演讲。我正在尝试使用 rvest
来这样做。到目前为止,这是我的代码:
# get main link
link <- "https://www.presidency.ucsb.edu/documents/presidential-documents-archive-guidebook/remarks-and-statements-the-first-lady-laura-bush"
# main page
page <- read_html(link)
# extract speech titles
title <- page %>% html_nodes("td.views-field-title") %>% html_text()
title_links = page %>% html_nodes("td.views-field-title") %>%
html_attr("href") %>% paste("https://www.presidency.ucsb.edu/",., sep="")
title_links
# extract year of speech
year <- page %>% html_nodes(".date-display-single") %>% html_text()
# extract name of person giving speech
flotus <- page %>% html_nodes(".views-field-title-1.nowrap") %>% html_text()
get_text <- function(title_link){
speech_page = read_html(title_links)
speech_text = speech_page %>% html_nodes(".field-docs-content p") %>%
html_text() %>% paste(collapse = ",")
return(speech_page)
}
text = sapply(title_links, FUN = get_text)
我在使用以下代码行时遇到问题:
title <- page %>% html_nodes("td.views-field-title") %>% html_text()
title_links = page %>% html_nodes("td.views-field-title") %>%
html_attr("href") %>% paste("https://www.presidency.ucsb.edu/",., sep="")
title_links
特别是,title_links
会生成一系列 link,如下所示:"https://www.presidency.ucsb.eduNA"
,而不是单个网页。有谁知道我在这里做错了什么?任何帮助将不胜感激。
您正在查询错误的 css 节点。 尝试:
page %>% html_elements(css = "td.views-field-title a") %>% html_attr('href')
[1] "https://www.presidency.ucsb.edu/documents/remarks-mrs-laura-bush-the-national-press-club"
[2] "https://www.presidency.ucsb.edu/documents/remarks-the-first-lady-un-commission-the-status-women-international-womens-day"
[3] "https://www.presidency.ucsb.edu/documents/remarks-the-first-lady-the-colorado-early-childhood-cognitive-development-summit"
[4] "https://www.presidency.ucsb.edu/documents/remarks-the-first-lady-the-10th-anniversary-the-holocaust-memorial-museum-and-opening-anne"
[5] "https://www.presidency.ucsb.edu/documents/remarks-the-first-lady-the-preserve-america-initiative-portland-maine"