使用 rvest 循环页面和爬虫 excel 文件路径
Loop pages and crawler excel file path using rvest
对于 this link 中的条目,我需要点击每个条目,然后点击页面左下角的 excel 文件路径的爬虫 url:
如何使用 R 中的 web scrapy 包实现这一点,例如 rvest
等?提前致谢。
library(rvest)
# Start by reading a HTML page with read_html():
common_list <- read_html("http://www.csrc.gov.cn/csrc/c100121/common_list.shtml")
common_list %>%
# extract paragraphs
rvest::html_nodes("a") %>%
# extract text
rvest::html_text() -> webtxt
# inspect
head(webtxt)
首先,我的问题是如何正确设置 html_nodes
以获得每个网页的 url?
更新:
> driver
$client
[1] "No sessionInfo. Client browser is mostly likely not opened."
$server
PROCESS 'file105483d2b3a.bat', running, pid 37512.
> remDr
$remoteServerAddr
[1] "localhost"
$port
[1] 4567
$browserName
[1] "chrome"
$version
[1] ""
$platform
[1] "ANY"
$javascript
[1] TRUE
$nativeEvents
[1] TRUE
$extraCapabilities
list()
当我运行remDr$navigate(url)
:
Error in checkError(res) :
Undefined error in httr call. httr output: length(url) == 1 is not TRUE
使用 rvest
得到 links,
library(rvest)
library(dplyr)
library(RSelenium)
link <- url %>%
read_html() %>%
html_nodes('.mt10')
link <- link[[2]] %>%
html_nodes("a") %>%
html_attr('href') %>% paste0('http://www.csrc.gov.cn', .)
[1] "http://www.csrc.gov.cn/csrc/c101921/c1758587/content.shtml"
[2] "http://www.csrc.gov.cn/csrc/c101921/c1714636/content.shtml"
[3] "http://www.csrc.gov.cn/csrc/c101921/c1664367/content.shtml"
[4] "http://www.csrc.gov.cn/csrc/c101921/c1657437/content.shtml"
[5] "http://www.csrc.gov.cn/csrc/c101921/c1657426/content.shtml"
我们可以使用 RSelenium
遍历 link 并下载 excel 文件。
我花了一分多钟才完全加载一个网页。我将使用单个 link.
进行演示
url <- "http://www.csrc.gov.cn/csrc/c101921/c1758587/content.shtml"
# launch the browser
driver <- rsDriver(browser = c("chrome"))
remDr <- driver[["client"]]
# click on the excel file path
remDr$navigate(url)
remDr$findElement('xpath', '//*[@id="files"]/a')$clickElement()
对于 this link 中的条目,我需要点击每个条目,然后点击页面左下角的 excel 文件路径的爬虫 url:
如何使用 R 中的 web scrapy 包实现这一点,例如 rvest
等?提前致谢。
library(rvest)
# Start by reading a HTML page with read_html():
common_list <- read_html("http://www.csrc.gov.cn/csrc/c100121/common_list.shtml")
common_list %>%
# extract paragraphs
rvest::html_nodes("a") %>%
# extract text
rvest::html_text() -> webtxt
# inspect
head(webtxt)
首先,我的问题是如何正确设置 html_nodes
以获得每个网页的 url?
更新:
> driver
$client
[1] "No sessionInfo. Client browser is mostly likely not opened."
$server
PROCESS 'file105483d2b3a.bat', running, pid 37512.
> remDr
$remoteServerAddr
[1] "localhost"
$port
[1] 4567
$browserName
[1] "chrome"
$version
[1] ""
$platform
[1] "ANY"
$javascript
[1] TRUE
$nativeEvents
[1] TRUE
$extraCapabilities
list()
当我运行remDr$navigate(url)
:
Error in checkError(res) :
Undefined error in httr call. httr output: length(url) == 1 is not TRUE
使用 rvest
得到 links,
library(rvest)
library(dplyr)
library(RSelenium)
link <- url %>%
read_html() %>%
html_nodes('.mt10')
link <- link[[2]] %>%
html_nodes("a") %>%
html_attr('href') %>% paste0('http://www.csrc.gov.cn', .)
[1] "http://www.csrc.gov.cn/csrc/c101921/c1758587/content.shtml"
[2] "http://www.csrc.gov.cn/csrc/c101921/c1714636/content.shtml"
[3] "http://www.csrc.gov.cn/csrc/c101921/c1664367/content.shtml"
[4] "http://www.csrc.gov.cn/csrc/c101921/c1657437/content.shtml"
[5] "http://www.csrc.gov.cn/csrc/c101921/c1657426/content.shtml"
我们可以使用 RSelenium
遍历 link 并下载 excel 文件。
我花了一分多钟才完全加载一个网页。我将使用单个 link.
url <- "http://www.csrc.gov.cn/csrc/c101921/c1758587/content.shtml"
# launch the browser
driver <- rsDriver(browser = c("chrome"))
remDr <- driver[["client"]]
# click on the excel file path
remDr$navigate(url)
remDr$findElement('xpath', '//*[@id="files"]/a')$clickElement()