使用 Rselenium 滚动浏览整个页面,然后将表格数据提取到数据框中

Scrolling through entire page with Rselenium, then extracting a tabular data into a data frame

我目前正在尝试使用 Rseleniumrvesttidyverse.

的组合来抓取网站

目标是转到此 this website,单击其中一个 link(例如,“Promo”),然后提取整个 table 数据(例如,卡片和分级价格)使用 rvest

我能够使用以下代码毫无问题地提取 table:

library(RSelenium)
library(rvest)
library(tidyverse)

pokemon <- read_html("https://www.pricecharting.com/console/pokemon-promo")

price_table <- pokemon %>% 
  html_elements("#games_table") %>% 
  html_table()

但是,这有几个问题:1) 我无法浏览我提供的初始网站 link (https://www.pricecharting.com/category/pokemon-cards) 上的所有不同卡片集,以及 2) 我无法提取使用此方法的整个 table - 只有主要加载的内容。

为了缓解这些问题,我正在研究 Rselenium。我决定做的是转到初始网站,单击 link 到卡片集(例如“Promo”),然后加载整个页面。此工作流程可在此处显示:

## open driver
rD <- rsDriver(browser="firefox", port=4545L, verbose=F)
remDr <- rD[["client"]]

## navigate to primary page
remDr$navigate("https://www.pricecharting.com/category/pokemon-cards")

## click on the link I want
remDr$findElement(using = "link text", "Promo")$clickElement()

## find the table
table <- remDr$findElement(using = "id", "games_table")

## load the entire table
table$sendKeysToElement(list(key = "end"))

## get the entire source
full_table <- remDr$getPageSource()[[1]]

## read in the table
html_page <- read_html(full_table)


## Do the `rvest` technique I had above.
html_page %>% 
  html_elements("#games_table") %>% 
  html_table()

但是,我的问题是我再次获得了相同的 51 个元素,而不是整个 table。

我想知道是否可以结合我的两种技术,以及我的编码过程中哪里出了问题。

我解决了这个问题。有两件事正在发生。首先是页面自动加载,光标位于搜索栏内。我通过 remDr$findElement(using = "css", "body")$clickElement() 单击文本正文摆脱了这个问题。接下来,正如 指出的那样,如果 scrolling/arrow 键不适用于 sendKeysToElement(list(key = "up_arrow")),您应该尝试 remDr$executeScript("window.scrollTo(0,document.body.scrollHeight);").

因此,我的脚本的一个小示例如下:

library(RSelenium)
library(rvest)
library(tidyverse)

## opens the driver
rD <- rsDriver(browser="firefox", port=4545L, verbose=F)
remDr <- rD[["client"]]

link_texts <- c("Base Set", "Promo", "Fossil")
## navigates to the correct page
remDr$navigate("https://www.pricecharting.com/category/pokemon-cards")

for (name in link_texts) {
  ## finds the link and clicks on it
  remDr$findElement(using = "link text", name)$clickElement()
  ## gets the table path
  remDr$findElement(using = "css", "body")$clickElement()
  ## finds the table - this line may be extraneous
  table <- remDr$findElement(using = "css", "body")
  ## scrolls to the bottom of the table
  remDr$executeScript("window.scrollTo(0,document.body.scrollHeight);")
  Sys.sleep(1)
  remDr$executeScript("window.scrollTo(0,document.body.scrollHeight);")
  Sys.sleep(1)
  remDr$executeScript("window.scrollTo(0,document.body.scrollHeight);")
  Sys.sleep(1)
  remDr$executeScript("window.scrollTo(0,document.body.scrollHeight);")
  Sys.sleep(1)
  remDr$executeScript("window.scrollTo(0,document.body.scrollHeight);")
  Sys.sleep(1)
  remDr$executeScript("window.scrollTo(0,document.body.scrollHeight);")
  Sys.sleep(1)
  ## get the entire page source that's been loaded
  html <- remDr$getPageSource()[[1]]
  ## read in the page source
  page <- read_html(html)
  
  data_name <- str_to_lower(str_replace(name, " ","_"))
  ## extract the tabular table
  df <- page %>% 
    html_elements("#games_table") %>% 
    html_table() %>% 
    pluck(1) %>% 
    select(1:4)
  assign(data_name, df)
  Sys.sleep(3)
  remDr$navigate("https://www.pricecharting.com/category/pokemon-cards")
}

## close driver
remDr$close()
rD$server$stop()

页面没有向下滚动,因为默认情况下光标位于搜索栏中。所以对您的代码进行了一些修改,使其完全向下滚动。

#Launch browser
rD <- rsDriver(browser="firefox", port=9545L, verbose=F)
remDr <- rD[["client"]]
remDr$navigate("https://www.pricecharting.com/category/pokemon-cards")
remDr$findElement(using = "link text", "Promo")$clickElement()

#clicking outside the search bar
remDr$findElement(using = "xpath", value = '//*[@id="console-page"]')$clickElement()

webElem <- remDr$findElement("css", "body")
#looping to get at the end of the page. 
for (i in 1:25){
  Sys.sleep(1)
  webElem$sendKeysToElement(list(key = "end"))
} 

#extract table
full_table <- remDr$getPageSource()[[1]]
html_page <- read_html(full_table)
html_page %>% 
  html_elements("#games_table") %>% 
  html_table()
[[1]]
# A tibble: 888 x 5
   Card                 Ungraded `Grade 9` `PSA 10`  ``                                                                                                        
   <chr>                <chr>    <chr>     <chr>     <chr>                                                                                                     
 1 Mew #8               .99    .79    .62    "+ Collection\n                                        In One Click\n                                    ~
 2 Mewtwo #3            .28    .91    7.50   "+ Collection\n                                        In One Click\n                                    ~
 3 Charizard GX #SM211  .85    .64    .50    "+ Collection\n                                        In One Click\n                                    ~
 4 Charizard V #SWSH050 .00    .99    .98    "+ Collection\n                                        In One Click\n                                    ~
 5 Pikachu #24          8.31  2.72   ,919.69 "+ Collection\n                                        In One Click\n                                    ~
 6 Entei #34            .50    .21    3.63   "+ Collection\n                                        In One Click\n                                    ~
 7 Ancient Mew          .79   .99    2.50   "+ Collection\n                                        In One Click\n                                    ~
 8 Charizard EX #XY121  .16   5.00   7.00   "+ Collection\n                                        In One Click\n                                    ~
 9 Mewtwo EX #XY107     .54    .50    .71    "+ Collection\n                                        In One Click\n                                    ~
10 Charizard GX #SM60   .57   3.98   2.00   "+ Collection\n                                        In One Click\n                                    ~
# ... with 878 more rows