通过多个页面抓取网页 table（缺少某些行）

Question

我想使用 rvest 从 https://irelandsgreatwardead.ie/the-archive/ 中抓取 table（包含有关 31,385 名士兵的信息）。

library(rvest)
library(dplyr)

page <- read_html(x = "https://irelandsgreatwardead.ie/the-archive/")    
table <- page             %>% 
  html_nodes("table")     %>%  
  html_table(fill = TRUE) %>%
  as.data.frame()

这有效，但仅适用于前 10 名士兵。在源代码中，我也只能看到前10名士兵的信息。任何有关如何获得与其他士兵的行的帮助将不胜感激！

谢谢，祝你有美好的一天！

Answer 1

这是RSelenium解决方案，

您可以遍历页面提取 table 并加入到上一个 table。

首先启动浏览器，

  library(RSelenium)
    driver = rsDriver(browser = c("firefox"))
    remDr <- driver[["client"]]
    remDr$navigate(url)

第 1 部分：从第一页提取 table 并存储在 df,

df = remDr$getPageSource()[[1]] %>% 
  read_html() %>%
  html_table() 
df = df[[1]]
#removing last row which is non-esstential
df = df[-nrow(df),]

第 2 部分：循环浏览第 2 页到第 5 页

for(i in 2:5){ 
#Building xpath for each page
xp = paste0('//*[@id="table_1_paginate"]/span/a[', i, ']')
cc <- remDr$findElement(using = 'xpath', value = xp)
cc$clickElement()

# Three second gap is given for the webpage to load
Sys.sleep(3)
df1 = remDr$getPageSource()[[1]] %>% 
  read_html() %>%
  html_table() 
df1 = df1[[1]]
df1 = df1[-nrow(df1),]

#Joining previous table `df` and present table `df1`
df = rbind(df, df1)
}

第 3 部分：循环浏览第 6 至 628 页的其余部分

剩余页数xpath保持不变。因此，我们必须重复此代码块 623 次才能从剩余页面中获取 table。

for (i in 1:623) {
x = i
cc <- remDr$findElement(using = 'xpath', value = '//*[@id="table_1_paginate"]/span/a[4]')
cc$clickElement()
Sys.sleep(3)
df1 = remDr$getPageSource()[[1]] %>% 
  read_html() %>%
  html_table() 
df1 = df1[[1]]
df1 = df1[-nrow(df1),]
df = rbind(df, df1)
}

现在我们有 df 所有士兵的信息。

Answer 2

library(RSelenium)
driver = rsDriver(browser = c("firefox"))

remDr <- driver[["client"]]
url <- 'https://irelandsgreatwardead.ie/the-archive/'
remDr$navigate(url)

# Locate the next page link
webElem <- remDr$findElement(using = "css", value = "a[data-dt-idx='3'")

# Click that link
webElem$clickElement()

# Get that table
remDr$getPageSource()[[1]] %>% 
  read_html() %>%
  html_table()

您的 for 循环需要从值 3（即第二页！）开始。在第二页上它变为 4，等等。但它永远不会超过 5。因为它是 'designed' 的方式所以你循环 3:5 然后在 5 每次保持在 5。

通过多个页面抓取网页 table（缺少某些行）

Scraping a web table through multiple pages (some rows are missing)

r

html-table

web-scraping

rselenium

rvest