使用 R pagedown 包将网页提取为 PDF,没有弹出窗口和 cookie 警告

Using R pagedown package to extract webpages as PDFs without pop-ups and cookie warnings

所以我的一个朋友在美食博客上写了 800 多篇文章,我希望将所有这些文章提取到 PDF 中,以便我可以很好地装订它们并将它们送给他。实在有太多文章无法手动使用 Chrome 的“另存为 PDF”,因此我正在寻找最清晰的方法 运行 通过以这种格式保存网站的循环。我有一个可行的解决方案,但是,最终的 PDF 在每一页上都有难看的广告和 cookie 警告横幅。当我在 Chrome 上手动 select “打印”为 PDF 时,我没有看到这一点。有没有一种方法可以使用 pagedown 将设置传递给 Chromium 以使其在没有这些元素的情况下打印?我在下面粘贴了我的代码,以及有问题的网站。

library(rvest)
library(dplyr)
library(tidyr)
library(stringr)
library(purrr)
library(downloader)

#Specifying the url for desired website to be scraped

url1 <- paste0('https://www.foodrepublic.com/author/george-embiricos/page/', '1', '/')

#Reading the HTML code from the website
webpage1 <- read_html(url1)

# Pull the links for all articles on George's initial author page

dat <- html_attr(html_nodes(webpage1, 'a'), "href") %>%
  as_tibble() %>%
  filter(str_detect(value, "([0-9]{4})")) %>%
  unique() %>%
  rename(link=value)

# Pull the links for all articles on George's 2nd-89th author page

for (i in 2:89) {

url <- paste0('https://www.foodrepublic.com/author/george-embiricos/page/', i, '/')

#Reading the HTML code from the website
webpage <- read_html(url)

links <- html_attr(html_nodes(webpage, 'a'), "href") %>%
  as_tibble() %>%
  filter(str_detect(value, "([0-9]{4})")) %>%
  unique() %>%
  rename(link=value)

dat <- bind_rows(dat, links) %>%
  unique()

}

dat <- dat %>%
  arrange(link)

# form 1-link vector to test with

tocollect<- dat$link[1]

pagedown::chrome_print(input=tocollect,
                       wait=20,
                       format = "pdf",
                       verbose = 0,
                       timeout=300)

我宁愿去掉页面上所有不需要的元素(尤其是脚本,而您想保留样式表),保存为临时 HTML 然后打印。写入的 HTML 文件在浏览器中看起来不错,但我无法测试打印:

for(l in articleUrls) {
  a <- read_html(l) 
  xml_remove(a %>% xml_find_all("aside"))
  xml_remove(a %>% xml_find_all("footer"))
  xml_remove(a %>% xml_find_all(xpath = "//*[contains(@class, 'article-related mb20')]"))
  xml_remove(a %>% xml_find_all(xpath = "//*[contains(@class, 'tags')]"))
  xml_remove(a %>% xml2::xml_find_all("//script"))
  xml_remove(a %>% xml_find_all("//*[contains(@class, 'ad box')]"))
  xml_remove(a %>% xml_find_all("//*[contains(@class, 'newsletter-signup')]"))
  xml_remove(a %>% xml_find_all("//*[contains(@class, 'article-footer')]"))
  xml_remove(a %>% xml_find_all("//*[contains(@class, 'article-footer-sidebar')]"))
  xml_remove(a %>% xml_find_all("//*[contains(@class, 'site-footer')]"))
  xml_remove(a %>% xml_find_all("//*[contains(@class, 'sticky-newsletter')]"))
  xml_remove(a %>% xml_find_all("//*[contains(@class, 'site-header')]"))
  
  xml2::write_html(a, file = "currentArticle.html")
  
  pagedown::chrome_print(input = "currentArticle.html")
}