使用 RSelenium 从网站（报纸档案）中抓取多个网页

Question

我设法从 newspaper archive according to explanations 中抓取了一页。

现在，我正在尝试通过运行一个代码自动执行访问页面列表的过程。制作 URL 列表很容易，因为报纸的存档具有类似的链接模式：

https://en.trend.az/archive/2021-XX-XX

问题在于编写一个循环来抓取 标题、日期、时间、类别 等数据。为简单起见，我尝试仅使用 2021-09-30 至 2021-10-02 的文章标题。

## Setting data frames

d1 <- as.Date("2021-09-30")
d2 <- as.Date("2021-10-02")

list_of_url <- character()   # or str_c()

## Generating subpage list 
 
for (i in format(seq(d1, d2, by="days"), format="%Y-%m-%d"))  {
  list_of_url[i] <- str_c ("https://en.trend.az", "/archive/", i)

# Launching browser

driver <- rsDriver(browser = c("firefox"))  #Version 93.0 (64-bit)
remDr <- driver[["client"]]
remDr$errorDetails
remDr$navigate(list_of_url[i])
   
   remDr0$findElement(using = "xpath", value = '/html/body/div[1]/div/div[1]/h1')$clickElement()
   
   webElem <- remDr$findElement("css", "body")
#scrolling to the end of webpage, to load all articles 
for (i in 1:25){
  Sys.sleep(2)
  webElem$sendKeysToElement(list(key = "end"))
} 

page <- read_html(remDr$getPageSource()[[1]])

# Scraping article headlines

get_headline <- page %>%
html_nodes('.category-article') %>% html_nodes('.article-title') %>% 
  html_text()
get_time <- str_sub(get_time, start= -5)

length(get_time)
   }
}

总长度应该是157+166+140=463。事实上，我什至没有设法从一页收集所有数据 (length(get_time) = 126)

我认为在循环中的第一组命令后，我获得了指定的3个日期的三个remDr，但后来没有独立识别它们。

因此，我尝试在 page <- 之前或之后通过

在初始循环中启动第二个循环

  for (remDr0 in remDr) {
page <- read_html(remDr0$getPageSource()[[1]])
# substituted all remDr-s below with remDr0

或

page <- read_html(remDr$getPageSource()[[1]])
for (page0 in page)
# substituted all page-s below with page0

然而，这些尝试以不同的错误结束。

非常感谢专家的帮助，因为这是我第一次将 R 用于此类目的。

希望可以更正我制作的现有循环，或者甚至建议更短的路径，例如制作 function。

Answer 1

为抓取多个类别而略微扩大

    library(RSelenium)
    library(dplyr)
    library(rvest)

提及日期范围

    d1 <- as.Date("2021-09-30")
    d2 <- as.Date("2021-10-02")
    dt = seq(d1, d2, by="days")#contains the date sequence
    
    #launch browser 
    driver <- rsDriver(browser = c("firefox"))  
    remDr <- driver[["client"]]
    
### `get_headline`  Function for newspaper headlines 

    get_headline = function(x){
      link = paste0( 'https://en.trend.az/archive/', x)
      remDr$navigate(link)
      remDr$findElement(using = "xpath", value = '/html/body/div[1]/div/div[1]/h1')$clickElement()
      webElem <- remDr$findElement("css", "body")
      #scrolling to the end of webpage, to load all articles 
      for (i in 1:25){
        Sys.sleep(1)
        webElem$sendKeysToElement(list(key = "end"))
      } 
      
      headlines = remDr$getPageSource()[[1]] %>% 
        read_html() %>%
        html_nodes('.category-article') %>% html_nodes('.article-title') %>% 
        html_text()
      headlines 
      return(headlines)
    }

`get_time`发布时的功能

get_time <- function(x){
  link = paste0( 'https://en.trend.az/archive/', x)
  remDr$navigate(link)
  remDr$findElement(using = "xpath", value = '/html/body/div[1]/div/div[1]/h1')$clickElement()
  webElem <- remDr$findElement("css", "body")
  #scrolling to the end of webpage, to load all articles 
  for (i in 1:25){
    Sys.sleep(1)
    webElem$sendKeysToElement(list(key = "end"))
  } 
  
  # Addressing selector of time on the website
  
  time <- remDr$getPageSource()[[1]] %>%
    read_html() %>%
    html_nodes('.category-article') %>% html_nodes('.article-date') %>% 
    html_text() %>%
    str_sub(start= -5)
  time
  return(time)
}

一篇文章中所有文章的编号page/day

get_number <- function(x){
  link = paste0( 'https://en.trend.az/archive/', x)
  remDr$navigate(link)
  remDr$findElement(using = "xpath", value = '/html/body/div[1]/div/div[1]/h1')$clickElement()
  webElem <- remDr$findElement("css", "body")
  #scrolling to the end of webpage, to load all articles 
  for (i in 1:25){
    Sys.sleep(1)
    webElem$sendKeysToElement(list(key = "end"))
  } 
  
  # Addressing selectors of headlines on the website
  
  headline <- remDr$getPageSource()[[1]] %>% 
    read_html() %>%
    html_nodes('.category-article') %>% html_nodes('.article-title') %>% 
    html_text()
  number <- seq(1:length(headline))
  return(number)
}

所有函数集合成`tibble`

get_data_table <- function(x){

      # Extract the Basic information from the HTML
      headline <- get_headline(x)
      time <- get_time(x)
      headline_number <- get_number(x)

      # Combine into a tibble
      combined_data <- tibble(Num = headline_number,
                              Article = headline,
                              Time = time) 
}

使用 `lapply` 遍历 `dt`

中的所有日期

    df = lapply(dt, get_data_table)

使用 RSelenium 从网站（报纸档案）中抓取多个网页

Scraping several webpages from a website (newspaper archive) using RSelenium

html

selenium

r

web-scraping

rvest

提及日期范围

`get_time`发布时的功能

一篇文章中所有文章的编号page/day

所有函数集合成`tibble`

使用 `lapply` 遍历 `dt`

使用 RSelenium 从网站（报纸档案）中抓取多个网页

Scraping several webpages from a website (newspaper archive) using RSelenium

html

selenium

r

web-scraping

rvest

提及日期范围

get_time发布时的功能

一篇文章中所有文章的编号page/day

所有函数集合成tibble

使用 lapply 遍历 dt

`get_time`发布时的功能

所有函数集合成`tibble`

使用 `lapply` 遍历 `dt`