当链接在 .dialog-off-canvas-main-canvas 内时,R 中的 Web Scraping URL 链接
Web Scraping URL links in R when links are within .dialog-off-canvas-main-canvas
长期潜伏者第一次海报。我是网络抓取和 R 的新手,我的代码主要是从 Whosebug 和 Youtube 生成的,所以我希望有人能帮助解决我遇到的难题。非常感谢。
最近,我一直在练习抓取链接。对于 Union of Concerned Scientists 的博客文章,这变得膨胀,见下文,为低效道歉,我是新人。
library(rvest)
library(dplyr)
library(readr)
library(stringr)
UCS_blog_links = data.frame()
for(page_result in seq(from = 1, to = 3, by = 1)) {
link = paste0("https://blog.ucsusa.org/page/",page_result)
page = read_html(link)
url_links = page%>% html_nodes(".post-thumbnail") %>%
html_attr("href")
UCS_blog_links = rbind(UCS_blog_links, data.frame(url_links, stringsAsFactors = FALSE))%>%
distinct()
print(paste("Page:", page_result))
}
但是当我在忧思科学家联盟上尝试同样的方法时 Press Releases 链接不在主页上,它们在“后面”.dialog-off-canvas-main- canvas 所以我想知道是否有人有任何提示来修改我必须首先进入节点 .dialog-off-canvas-main-canvas 然后抓取链接的代码。或者如果需要其他方法。
我们可以通过
获取链接
url = 'https://www.ucsusa.org/about/news/press-releases' %>% read_html() %>% html_nodes('.view-content') %>% html_nodes('a') %>% html_attr('href')
url = unique(url)
[1] "/about/news/experts-tell-epa-follow-science-protect-communities-ethylene-oxide"
[2] "/about/news/new-sec-rule-vital-transparent-accounting-mounting-climate-risks-businesses-protecting-1"
[3] "/about/news/union-concerned-scientists-applauds-repeal-trump-era-agency-action-scrapping-californias"
[4] "/about/news/proposed-epa-truck-pollution-standard-falls-short-whats-needed-healthier-safer-future"
长期潜伏者第一次海报。我是网络抓取和 R 的新手,我的代码主要是从 Whosebug 和 Youtube 生成的,所以我希望有人能帮助解决我遇到的难题。非常感谢。
最近,我一直在练习抓取链接。对于 Union of Concerned Scientists 的博客文章,这变得膨胀,见下文,为低效道歉,我是新人。
library(rvest)
library(dplyr)
library(readr)
library(stringr)
UCS_blog_links = data.frame()
for(page_result in seq(from = 1, to = 3, by = 1)) {
link = paste0("https://blog.ucsusa.org/page/",page_result)
page = read_html(link)
url_links = page%>% html_nodes(".post-thumbnail") %>%
html_attr("href")
UCS_blog_links = rbind(UCS_blog_links, data.frame(url_links, stringsAsFactors = FALSE))%>%
distinct()
print(paste("Page:", page_result))
}
但是当我在忧思科学家联盟上尝试同样的方法时 Press Releases 链接不在主页上,它们在“后面”.dialog-off-canvas-main- canvas 所以我想知道是否有人有任何提示来修改我必须首先进入节点 .dialog-off-canvas-main-canvas 然后抓取链接的代码。或者如果需要其他方法。
我们可以通过
获取链接url = 'https://www.ucsusa.org/about/news/press-releases' %>% read_html() %>% html_nodes('.view-content') %>% html_nodes('a') %>% html_attr('href')
url = unique(url)
[1] "/about/news/experts-tell-epa-follow-science-protect-communities-ethylene-oxide"
[2] "/about/news/new-sec-rule-vital-transparent-accounting-mounting-climate-risks-businesses-protecting-1"
[3] "/about/news/union-concerned-scientists-applauds-repeal-trump-era-agency-action-scrapping-californias"
[4] "/about/news/proposed-epa-truck-pollution-standard-falls-short-whats-needed-healthier-safer-future"