当链接在 .dialog-off-canvas-main-canvas 内时，R 中的 Web Scraping URL 链接

Question

长期潜伏者第一次海报。我是网络抓取和 R 的新手，我的代码主要是从 Whosebug 和 Youtube 生成的，所以我希望有人能帮助解决我遇到的难题。非常感谢。

最近，我一直在练习抓取链接。对于 Union of Concerned Scientists 的博客文章，这变得膨胀，见下文，为低效道歉，我是新人。

library(rvest)
library(dplyr)
library(readr)
library(stringr)

UCS_blog_links = data.frame()

for(page_result in seq(from = 1, to = 3, by = 1)) {
  link = paste0("https://blog.ucsusa.org/page/",page_result)
  page = read_html(link)
  url_links = page%>% html_nodes(".post-thumbnail") %>%
    html_attr("href")
  UCS_blog_links = rbind(UCS_blog_links, data.frame(url_links, stringsAsFactors = FALSE))%>%
    distinct()
  print(paste("Page:", page_result))
}

但是当我在忧思科学家联盟上尝试同样的方法时 Press Releases 链接不在主页上，它们在“后面”.dialog-off-canvas-main- canvas 所以我想知道是否有人有任何提示来修改我必须首先进入节点 .dialog-off-canvas-main-canvas 然后抓取链接的代码。或者如果需要其他方法。

Answer 1

我们可以通过

获取链接

url = 'https://www.ucsusa.org/about/news/press-releases' %>% read_html() %>% html_nodes('.view-content') %>% html_nodes('a') %>% html_attr('href')
url = unique(url)

 [1] "/about/news/experts-tell-epa-follow-science-protect-communities-ethylene-oxide"                      
 [2] "/about/news/new-sec-rule-vital-transparent-accounting-mounting-climate-risks-businesses-protecting-1"
 [3] "/about/news/union-concerned-scientists-applauds-repeal-trump-era-agency-action-scrapping-californias"
 [4] "/about/news/proposed-epa-truck-pollution-standard-falls-short-whats-needed-healthier-safer-future"

当链接在 .dialog-off-canvas-main-canvas 内时，R 中的 Web Scraping URL 链接

Web Scraping URL links in R when links are within .dialog-off-canvas-main-canvas

r

canvas

hyperlink

web-scraping