rvest:从网站收集页码
rvest: Collecting page numbers from a website
我有以下代码来尝试抓取网站:
url = "https://www.fotocasa.es/es/alquiler/casas/madrid-capital/todas-las-zonas/l"
x <- GET(url)
x %>%
read_html() %>%
html_nodes(xpath='//*[@id="App"]/div[2]/div[1]/main/div/div[4]')
我想做的是收集页面底部的页码,以前下面的方法有效 html_nodes(".sui-PaginationBasic-item a")
但是,现在不行了。我尝试使用 inspect 元素放置 xpath
。
输出类似于:
c(1, 2, 3, 4, 5, ..., 101)
取决于给定页面上有多少页。
使用RSelenium
我们可以通过分页获得页码,
library(stringr)
library(RSelenium)
library(dplyr)
#launching browser
driver = rsDriver(browser = c("firefox"))
remDr <- driver[["client"]]
url = "https://www.fotocasa.es/es/alquiler/casas/madrid-capital/todas-las-zonas/l"
remDr$navigate(url)
#accept cookie
remDr$findElement(using = "xpath",'/html/body/div[1]/div[4]/div/div/div/footer/div/button[2]')$clickElement()
#scroll to the end of page
webElem <- remDr$findElement("css", "html")
webElem$sendKeysToElement(list(key="end"))
#use the up_arrow to get pagination into view
webElem$sendKeysToElement(list(key = "up_arrow"))
#get page url from pagination
link = remDr$getPageSource()[[1]] %>% read_html() %>% html_nodes('.re-Pagination') %>% html_nodes('a') %>%
html_attr('href')
#extract only page numbers from urls
str_extract(link, "[[:digit:]]+")
[1] NA "2" "3" "4" "200" "2"
我有以下代码来尝试抓取网站:
url = "https://www.fotocasa.es/es/alquiler/casas/madrid-capital/todas-las-zonas/l"
x <- GET(url)
x %>%
read_html() %>%
html_nodes(xpath='//*[@id="App"]/div[2]/div[1]/main/div/div[4]')
我想做的是收集页面底部的页码,以前下面的方法有效 html_nodes(".sui-PaginationBasic-item a")
但是,现在不行了。我尝试使用 inspect 元素放置 xpath
。
输出类似于:
c(1, 2, 3, 4, 5, ..., 101)
取决于给定页面上有多少页。
使用RSelenium
我们可以通过分页获得页码,
library(stringr)
library(RSelenium)
library(dplyr)
#launching browser
driver = rsDriver(browser = c("firefox"))
remDr <- driver[["client"]]
url = "https://www.fotocasa.es/es/alquiler/casas/madrid-capital/todas-las-zonas/l"
remDr$navigate(url)
#accept cookie
remDr$findElement(using = "xpath",'/html/body/div[1]/div[4]/div/div/div/footer/div/button[2]')$clickElement()
#scroll to the end of page
webElem <- remDr$findElement("css", "html")
webElem$sendKeysToElement(list(key="end"))
#use the up_arrow to get pagination into view
webElem$sendKeysToElement(list(key = "up_arrow"))
#get page url from pagination
link = remDr$getPageSource()[[1]] %>% read_html() %>% html_nodes('.re-Pagination') %>% html_nodes('a') %>%
html_attr('href')
#extract only page numbers from urls
str_extract(link, "[[:digit:]]+")
[1] NA "2" "3" "4" "200" "2"