使用 R pagedown 包将网页提取为 PDF，没有弹出窗口和 cookie 警告

Question

所以我的一个朋友在美食博客上写了 800 多篇文章，我希望将所有这些文章提取到 PDF 中，以便我可以很好地装订它们并将它们送给他。实在有太多文章无法手动使用 Chrome 的“另存为 PDF”，因此我正在寻找最清晰的方法运行通过以这种格式保存网站的循环。我有一个可行的解决方案，但是，最终的 PDF 在每一页上都有难看的广告和 cookie 警告横幅。当我在 Chrome 上手动 select “打印”为 PDF 时，我没有看到这一点。有没有一种方法可以使用 pagedown 将设置传递给 Chromium 以使其在没有这些元素的情况下打印？我在下面粘贴了我的代码，以及有问题的网站。

library(rvest)
library(dplyr)
library(tidyr)
library(stringr)
library(purrr)
library(downloader)

#Specifying the url for desired website to be scraped

url1 <- paste0('https://www.foodrepublic.com/author/george-embiricos/page/', '1', '/')

#Reading the HTML code from the website
webpage1 <- read_html(url1)

# Pull the links for all articles on George's initial author page

dat <- html_attr(html_nodes(webpage1, 'a'), "href") %>%
  as_tibble() %>%
  filter(str_detect(value, "([0-9]{4})")) %>%
  unique() %>%
  rename(link=value)

# Pull the links for all articles on George's 2nd-89th author page

for (i in 2:89) {

url <- paste0('https://www.foodrepublic.com/author/george-embiricos/page/', i, '/')

#Reading the HTML code from the website
webpage <- read_html(url)

links <- html_attr(html_nodes(webpage, 'a'), "href") %>%
  as_tibble() %>%
  filter(str_detect(value, "([0-9]{4})")) %>%
  unique() %>%
  rename(link=value)

dat <- bind_rows(dat, links) %>%
  unique()

}

dat <- dat %>%
  arrange(link)

# form 1-link vector to test with

tocollect<- dat$link[1]

pagedown::chrome_print(input=tocollect,
                       wait=20,
                       format = "pdf",
                       verbose = 0,
                       timeout=300)

Answer 1

我宁愿去掉页面上所有不需要的元素（尤其是脚本，而您想保留样式表），保存为临时 HTML 然后打印。写入的 HTML 文件在浏览器中看起来不错，但我无法测试打印：

for(l in articleUrls) {
  a <- read_html(l) 
  xml_remove(a %>% xml_find_all("aside"))
  xml_remove(a %>% xml_find_all("footer"))
  xml_remove(a %>% xml_find_all(xpath = "//*[contains(@class, 'article-related mb20')]"))
  xml_remove(a %>% xml_find_all(xpath = "//*[contains(@class, 'tags')]"))
  xml_remove(a %>% xml2::xml_find_all("//script"))
  xml_remove(a %>% xml_find_all("//*[contains(@class, 'ad box')]"))
  xml_remove(a %>% xml_find_all("//*[contains(@class, 'newsletter-signup')]"))
  xml_remove(a %>% xml_find_all("//*[contains(@class, 'article-footer')]"))
  xml_remove(a %>% xml_find_all("//*[contains(@class, 'article-footer-sidebar')]"))
  xml_remove(a %>% xml_find_all("//*[contains(@class, 'site-footer')]"))
  xml_remove(a %>% xml_find_all("//*[contains(@class, 'sticky-newsletter')]"))
  xml_remove(a %>% xml_find_all("//*[contains(@class, 'site-header')]"))
  
  xml2::write_html(a, file = "currentArticle.html")
  
  pagedown::chrome_print(input = "currentArticle.html")
}

使用 R pagedown 包将网页提取为 PDF，没有弹出窗口和 cookie 警告

Using R pagedown package to extract webpages as PDFs without pop-ups and cookie warnings

html

pdf

r

pagedown