使用 Rselenium 滚动浏览整个页面,然后将表格数据提取到数据框中
Scrolling through entire page with Rselenium, then extracting a tabular data into a data frame
我目前正在尝试使用 Rselenium
、rvest
和 tidyverse
.
的组合来抓取网站
目标是转到此 this website,单击其中一个 link(例如,“Promo”),然后提取整个 table 数据(例如,卡片和分级价格)使用 rvest
。
我能够使用以下代码毫无问题地提取 table:
library(RSelenium)
library(rvest)
library(tidyverse)
pokemon <- read_html("https://www.pricecharting.com/console/pokemon-promo")
price_table <- pokemon %>%
html_elements("#games_table") %>%
html_table()
但是,这有几个问题:1) 我无法浏览我提供的初始网站 link (https://www.pricecharting.com/category/pokemon-cards) 上的所有不同卡片集,以及 2) 我无法提取使用此方法的整个 table - 只有主要加载的内容。
为了缓解这些问题,我正在研究 Rselenium
。我决定做的是转到初始网站,单击 link 到卡片集(例如“Promo”),然后加载整个页面。此工作流程可在此处显示:
## open driver
rD <- rsDriver(browser="firefox", port=4545L, verbose=F)
remDr <- rD[["client"]]
## navigate to primary page
remDr$navigate("https://www.pricecharting.com/category/pokemon-cards")
## click on the link I want
remDr$findElement(using = "link text", "Promo")$clickElement()
## find the table
table <- remDr$findElement(using = "id", "games_table")
## load the entire table
table$sendKeysToElement(list(key = "end"))
## get the entire source
full_table <- remDr$getPageSource()[[1]]
## read in the table
html_page <- read_html(full_table)
## Do the `rvest` technique I had above.
html_page %>%
html_elements("#games_table") %>%
html_table()
但是,我的问题是我再次获得了相同的 51 个元素,而不是整个 table。
我想知道是否可以结合我的两种技术,以及我的编码过程中哪里出了问题。
我解决了这个问题。有两件事正在发生。首先是页面自动加载,光标位于搜索栏内。我通过 remDr$findElement(using = "css", "body")$clickElement()
单击文本正文摆脱了这个问题。接下来,正如 指出的那样,如果 scrolling/arrow 键不适用于 sendKeysToElement(list(key = "up_arrow"))
,您应该尝试 remDr$executeScript("window.scrollTo(0,document.body.scrollHeight);")
.
因此,我的脚本的一个小示例如下:
library(RSelenium)
library(rvest)
library(tidyverse)
## opens the driver
rD <- rsDriver(browser="firefox", port=4545L, verbose=F)
remDr <- rD[["client"]]
link_texts <- c("Base Set", "Promo", "Fossil")
## navigates to the correct page
remDr$navigate("https://www.pricecharting.com/category/pokemon-cards")
for (name in link_texts) {
## finds the link and clicks on it
remDr$findElement(using = "link text", name)$clickElement()
## gets the table path
remDr$findElement(using = "css", "body")$clickElement()
## finds the table - this line may be extraneous
table <- remDr$findElement(using = "css", "body")
## scrolls to the bottom of the table
remDr$executeScript("window.scrollTo(0,document.body.scrollHeight);")
Sys.sleep(1)
remDr$executeScript("window.scrollTo(0,document.body.scrollHeight);")
Sys.sleep(1)
remDr$executeScript("window.scrollTo(0,document.body.scrollHeight);")
Sys.sleep(1)
remDr$executeScript("window.scrollTo(0,document.body.scrollHeight);")
Sys.sleep(1)
remDr$executeScript("window.scrollTo(0,document.body.scrollHeight);")
Sys.sleep(1)
remDr$executeScript("window.scrollTo(0,document.body.scrollHeight);")
Sys.sleep(1)
## get the entire page source that's been loaded
html <- remDr$getPageSource()[[1]]
## read in the page source
page <- read_html(html)
data_name <- str_to_lower(str_replace(name, " ","_"))
## extract the tabular table
df <- page %>%
html_elements("#games_table") %>%
html_table() %>%
pluck(1) %>%
select(1:4)
assign(data_name, df)
Sys.sleep(3)
remDr$navigate("https://www.pricecharting.com/category/pokemon-cards")
}
## close driver
remDr$close()
rD$server$stop()
页面没有向下滚动,因为默认情况下光标位于搜索栏中。所以对您的代码进行了一些修改,使其完全向下滚动。
#Launch browser
rD <- rsDriver(browser="firefox", port=9545L, verbose=F)
remDr <- rD[["client"]]
remDr$navigate("https://www.pricecharting.com/category/pokemon-cards")
remDr$findElement(using = "link text", "Promo")$clickElement()
#clicking outside the search bar
remDr$findElement(using = "xpath", value = '//*[@id="console-page"]')$clickElement()
webElem <- remDr$findElement("css", "body")
#looping to get at the end of the page.
for (i in 1:25){
Sys.sleep(1)
webElem$sendKeysToElement(list(key = "end"))
}
#extract table
full_table <- remDr$getPageSource()[[1]]
html_page <- read_html(full_table)
html_page %>%
html_elements("#games_table") %>%
html_table()
[[1]]
# A tibble: 888 x 5
Card Ungraded `Grade 9` `PSA 10` ``
<chr> <chr> <chr> <chr> <chr>
1 Mew #8 .99 .79 .62 "+ Collection\n In One Click\n ~
2 Mewtwo #3 .28 .91 7.50 "+ Collection\n In One Click\n ~
3 Charizard GX #SM211 .85 .64 .50 "+ Collection\n In One Click\n ~
4 Charizard V #SWSH050 .00 .99 .98 "+ Collection\n In One Click\n ~
5 Pikachu #24 8.31 2.72 ,919.69 "+ Collection\n In One Click\n ~
6 Entei #34 .50 .21 3.63 "+ Collection\n In One Click\n ~
7 Ancient Mew .79 .99 2.50 "+ Collection\n In One Click\n ~
8 Charizard EX #XY121 .16 5.00 7.00 "+ Collection\n In One Click\n ~
9 Mewtwo EX #XY107 .54 .50 .71 "+ Collection\n In One Click\n ~
10 Charizard GX #SM60 .57 3.98 2.00 "+ Collection\n In One Click\n ~
# ... with 878 more rows
我目前正在尝试使用 Rselenium
、rvest
和 tidyverse
.
目标是转到此 this website,单击其中一个 link(例如,“Promo”),然后提取整个 table 数据(例如,卡片和分级价格)使用 rvest
。
我能够使用以下代码毫无问题地提取 table:
library(RSelenium)
library(rvest)
library(tidyverse)
pokemon <- read_html("https://www.pricecharting.com/console/pokemon-promo")
price_table <- pokemon %>%
html_elements("#games_table") %>%
html_table()
但是,这有几个问题:1) 我无法浏览我提供的初始网站 link (https://www.pricecharting.com/category/pokemon-cards) 上的所有不同卡片集,以及 2) 我无法提取使用此方法的整个 table - 只有主要加载的内容。
为了缓解这些问题,我正在研究 Rselenium
。我决定做的是转到初始网站,单击 link 到卡片集(例如“Promo”),然后加载整个页面。此工作流程可在此处显示:
## open driver
rD <- rsDriver(browser="firefox", port=4545L, verbose=F)
remDr <- rD[["client"]]
## navigate to primary page
remDr$navigate("https://www.pricecharting.com/category/pokemon-cards")
## click on the link I want
remDr$findElement(using = "link text", "Promo")$clickElement()
## find the table
table <- remDr$findElement(using = "id", "games_table")
## load the entire table
table$sendKeysToElement(list(key = "end"))
## get the entire source
full_table <- remDr$getPageSource()[[1]]
## read in the table
html_page <- read_html(full_table)
## Do the `rvest` technique I had above.
html_page %>%
html_elements("#games_table") %>%
html_table()
但是,我的问题是我再次获得了相同的 51 个元素,而不是整个 table。
我想知道是否可以结合我的两种技术,以及我的编码过程中哪里出了问题。
我解决了这个问题。有两件事正在发生。首先是页面自动加载,光标位于搜索栏内。我通过 remDr$findElement(using = "css", "body")$clickElement()
单击文本正文摆脱了这个问题。接下来,正如 sendKeysToElement(list(key = "up_arrow"))
,您应该尝试 remDr$executeScript("window.scrollTo(0,document.body.scrollHeight);")
.
因此,我的脚本的一个小示例如下:
library(RSelenium)
library(rvest)
library(tidyverse)
## opens the driver
rD <- rsDriver(browser="firefox", port=4545L, verbose=F)
remDr <- rD[["client"]]
link_texts <- c("Base Set", "Promo", "Fossil")
## navigates to the correct page
remDr$navigate("https://www.pricecharting.com/category/pokemon-cards")
for (name in link_texts) {
## finds the link and clicks on it
remDr$findElement(using = "link text", name)$clickElement()
## gets the table path
remDr$findElement(using = "css", "body")$clickElement()
## finds the table - this line may be extraneous
table <- remDr$findElement(using = "css", "body")
## scrolls to the bottom of the table
remDr$executeScript("window.scrollTo(0,document.body.scrollHeight);")
Sys.sleep(1)
remDr$executeScript("window.scrollTo(0,document.body.scrollHeight);")
Sys.sleep(1)
remDr$executeScript("window.scrollTo(0,document.body.scrollHeight);")
Sys.sleep(1)
remDr$executeScript("window.scrollTo(0,document.body.scrollHeight);")
Sys.sleep(1)
remDr$executeScript("window.scrollTo(0,document.body.scrollHeight);")
Sys.sleep(1)
remDr$executeScript("window.scrollTo(0,document.body.scrollHeight);")
Sys.sleep(1)
## get the entire page source that's been loaded
html <- remDr$getPageSource()[[1]]
## read in the page source
page <- read_html(html)
data_name <- str_to_lower(str_replace(name, " ","_"))
## extract the tabular table
df <- page %>%
html_elements("#games_table") %>%
html_table() %>%
pluck(1) %>%
select(1:4)
assign(data_name, df)
Sys.sleep(3)
remDr$navigate("https://www.pricecharting.com/category/pokemon-cards")
}
## close driver
remDr$close()
rD$server$stop()
页面没有向下滚动,因为默认情况下光标位于搜索栏中。所以对您的代码进行了一些修改,使其完全向下滚动。
#Launch browser
rD <- rsDriver(browser="firefox", port=9545L, verbose=F)
remDr <- rD[["client"]]
remDr$navigate("https://www.pricecharting.com/category/pokemon-cards")
remDr$findElement(using = "link text", "Promo")$clickElement()
#clicking outside the search bar
remDr$findElement(using = "xpath", value = '//*[@id="console-page"]')$clickElement()
webElem <- remDr$findElement("css", "body")
#looping to get at the end of the page.
for (i in 1:25){
Sys.sleep(1)
webElem$sendKeysToElement(list(key = "end"))
}
#extract table
full_table <- remDr$getPageSource()[[1]]
html_page <- read_html(full_table)
html_page %>%
html_elements("#games_table") %>%
html_table()
[[1]]
# A tibble: 888 x 5
Card Ungraded `Grade 9` `PSA 10` ``
<chr> <chr> <chr> <chr> <chr>
1 Mew #8 .99 .79 .62 "+ Collection\n In One Click\n ~
2 Mewtwo #3 .28 .91 7.50 "+ Collection\n In One Click\n ~
3 Charizard GX #SM211 .85 .64 .50 "+ Collection\n In One Click\n ~
4 Charizard V #SWSH050 .00 .99 .98 "+ Collection\n In One Click\n ~
5 Pikachu #24 8.31 2.72 ,919.69 "+ Collection\n In One Click\n ~
6 Entei #34 .50 .21 3.63 "+ Collection\n In One Click\n ~
7 Ancient Mew .79 .99 2.50 "+ Collection\n In One Click\n ~
8 Charizard EX #XY121 .16 5.00 7.00 "+ Collection\n In One Click\n ~
9 Mewtwo EX #XY107 .54 .50 .71 "+ Collection\n In One Click\n ~
10 Charizard GX #SM60 .57 3.98 2.00 "+ Collection\n In One Click\n ~
# ... with 878 more rows