R 中的 Web 抓取,从具有延迟加载页面的网站中提取 url

Web scraping in R, extracting urls from website with lazy-loading pages

我正在尝试从以下网站提取网址。这里棘手的是网站会自动加载新页面。我没有设法获取用于抓取所有 url 的 xpath,包括新加载页面上的 url——我只设法获取前 15 个 url(超过 70 个)。我假设最后一行 (new_results...) 中的 xpath 缺少一些关键元素来解释后面的页面。有任何想法吗?谢谢!

# load packages
library(rvest)
library(httr)
library(RCurl)
library(XML)
library(stringr)
library(xml2)


# aim: download all speeches stored at:
# https://sheikhmohammed.ae/en-us/Speeches

# first, create vector which stores all urls to each single speech
all_links <- character() 
new_results <- "/en-us/Speeches"
signatures = system.file("CurlSSL", cainfo = "cacert.pem", package =  "RCurl") 
options(RCurlOptions = list(verbose = FALSE, capath =  system.file("CurlSSL", "cacert.pem", package = "RCurl"), ssl.verifypeer = FALSE))

while(length(new_results) > 0){ 
new_results <- str_c("https://sheikhmohammed.ae", new_results)
results <- getURL(new_results, cainfo = signatures) 
results_tree <- htmlParse(results) 
all_links <- c(all_links, xpathSApply(results_tree,"//div[@class='speech-share-board']", xmlGetAttr,"data-url"))
new_results <- xpathSApply(results_tree,"//div[@class='speech-share-board']//after",xmlGetAttr, "data-url")}

# or, alternatively with phantomjs (also here, it loads only first 15 urls):
url <- "https://sheikhmohammed.ae/en-us/Speeches#"

# write out a script phantomjs can process
writeLines(sprintf("var page = require('webpage').create();
               page.open('%s', function () {
               console.log(page.content); //page source
               phantom.exit();
               });", url), con="scrape.js")

# process it with phantomjs
write(readLines(pipe("phantomjs scrape.js", "r")), "scrape.html")

运行 Javascript 在 RSelenium 中延迟加载或在 Python 中延迟加载 Selenium 将是解决问题的最优雅方法。然而,作为一种不太优雅但速度更快的替代方法,可以手动更改 firefox 开发 modus/network 功能中 json 查询的设置,以一次加载不仅 15 个而且更多(=所有)个演讲。这对我来说效果很好,我能够通过 json 响应提取所有链接。