"Next Page" rvest scrape 的函数
Function for "Next Page" rvest scrape
我已经在底部添加了我使用的最终代码,以防有人有类似的问题。我使用了下面提供的答案,但添加了几个节点、系统睡眠时间(以防止被踢出服务器)和一个 if 参数以防止在最后一个有效页面被抓取后出现错误。
我正在尝试使用下一页功能从网站中提取多个页面。我创建了一个带有 nextpage 变量的数据框,并用 url.
开头填充了第一个值
#building dataframe with variables
bframe <- data.frame(matrix(ncol = 3, nrow = 10000))
x <- c("curpage", "nexturl", "posttext")
colnames(bframe) <- x
#assigning first value for nexturl
bframe$nexturl[[1]] <- "http://www.ashleyannphotography.com/blog/2017/04/02/canopy-anna-turner/"
我想按如下方式提取文本(我知道代码很笨重——我是新手——但它确实得到了我想要的)
##create html object
blogfunc <- read_html("http://www.ashleyannphotography.com/blog/2017/04/02/canopy-anna-turner/")
##create object with post content scraped
posttext <- blogfunc %>%
html_nodes(".article-content")%>%
html_text()
posttext <- gsub('[\a]', '', blogfunc)
posttext <- gsub('[\t]', '', blogfunc)
posttext <- gsub('[\n]', '', blogfunc)
##scrape next url
nexturl <- blogfunc %>%
html_nodes(".prev-post-link-wrap a") %>%
html_attr("href")
关于将以上内容转换为函数并使用它来填充数据框有什么建议吗?我正在努力应用在线示例。
带睡眠时间的工作答案以及最后一个有效页面之后的参数。
```{r}
library(rvest)
url <- "http://www.ashleyannphotography.com/blog/2008/05/31/the-making-of-a-wet-willy/"
#Select first page.
getPostContent <- function(url){
Sys.sleep(2)
#Introduces pauses to convince server not robot.
read_html(url) %>%
html_nodes(".article-content")%>%
html_text() %>%
gsub(x = ., pattern = '[\a\t\n]', replacement = '')
}
#Pulls node for post content.
getDate <- function(url) {
Sys.sleep(2.6)
read_html(url) %>%
html_node(".updated") %>%
html_text()
}
#Pulls node for date.
getTitle <- function(url) {
Sys.sleep(.8)
read_html(url) %>%
html_node(".article-title") %>%
html_text()
}
#Pulls node for title.
getNextUrl <- function(url) {
Sys.sleep(.2)
read_html(url) %>%
html_node(".prev-post-link-wrap a") %>%
html_attr("href")
}
#Pulls node for url to previous post.
scrapeBackMap <- function(url, n){
Sys.sleep(3)
purrr::map_df(1:n, ~{
if(!is.na(url)){
#Only run if URL is not NA
oUrl <- url
date <- getDate(url)
post <- getPostContent(url)
title <- getTitle(url)
url <<- getNextUrl(url)
data.frame(curpage = oUrl,
nexturl = url,
posttext = post,
pubdate = date,
ptitle = title
#prepares functions for dataframe
)}
})
}
res <- scrapeBackMap(url, 3000)
class(res)
str(res)
#creates dataframe
```
我的想法是抓取每个 post 内容,找到 'previous post' url,导航到 url 并重复该过程。
library(rvest)
url <- "http://www.ashleyannphotography.com/blog/2017/04/02/canopy-anna-turner/"
抓取 post 的内容
getPostContent <- function(url){
read_html(url) %>%
html_nodes(".article-content")%>%
html_text() %>%
gsub(x = ., pattern = '[\a\t\n]', replacement = '')
}
刮下一个url
getNextUrl <- function(url) {
read_html(url) %>%
html_node(".prev-post-link-wrap a") %>%
html_attr("href")
}
一旦我们有了这些 'support' 函数,我们就可以将它们粘合在一起。
应用函数n
次
我猜 for
循环或 while
可以设置为继续直到 getNextUrl
return NULL
,但我更喜欢定义一个 n
的跳回并在每个 'jump' 处应用函数。
从原始 url
开始,我们检索其内容,然后用提取的新值覆盖 url
并继续,直到循环中断。
scrapeBackApply <- function(url, n) {
sapply(1:n, function(x) {
r <- getPostContent(url)
# Overwrite global 'url'
url <<- getNextUrl(url)
r
})
}
或者我们可以使用 purrr::map
系列,特别是 map_df
直接获得 data.frame
作为您的 bframe
。
scrapeBackMap <- function(url, n) {
purrr::map_df(1:n, ~{
oUrl <- url
post <- getPostContent(url)
url <<- getNextUrl(url)
data.frame(curpage = oUrl,
nexturl = url,
posttext = post)
})
}
结果
res <- scrapeBackApply(url, 2)
class(res)
#> [1] "character"
str(res)
#> chr [1:2] "Six years ago this month, my eldest/oldest/elder/older daughter<U+0085>Okay sidenote <U+0096> the #1 grammar correction I receive on a regula"| __truncated__ ...
res <- scrapeBackMap(url, 4)
class(res)
#> [1] "data.frame"
str(res)
#> 'data.frame': 4 obs. of 3 variables:
#> $ curpage : chr "http://www.ashleyannphotography.com/blog/2017/04/02/canopy-anna-turner/" "http://www.ashleyannphotography.com/blog/2017/03/31/a-guest-post-an-snapshop-interview/" "http://www.ashleyannphotography.com/blog/2017/03/29/explore-il-casey-small-town-big-things/" "http://www.ashleyannphotography.com/blog/2017/03/27/explore-ok-oklahoma-wondertorium/"
#> $ nexturl : chr "http://www.ashleyannphotography.com/blog/2017/03/31/a-guest-post-an-snapshop-interview/" "http://www.ashleyannphotography.com/blog/2017/03/29/explore-il-casey-small-town-big-things/" "http://www.ashleyannphotography.com/blog/2017/03/27/explore-ok-oklahoma-wondertorium/" "http://www.ashleyannphotography.com/blog/2017/03/24/the-youngest-cousin/"
#> $ posttext: chr "Six years ago this month, my eldest/oldest/elder/older daughter<U+0085>Okay sidenote <U+0096> the #1 grammar correction I receive on a regula"| __truncated__ "Today I am guest posting over on the Bought Beautifully blog about something new my family tried as a way to usher in our Easte"| __truncated__ "A couple of weeks ago, we drove to Illinois to watch one my nieces in a track meet and another niece in her high school musical"| __truncated__ "Often the activities we do as a family tend to cater more towards our older kids than the girls. The girls are always in the mi"| __truncated__
我已经在底部添加了我使用的最终代码,以防有人有类似的问题。我使用了下面提供的答案,但添加了几个节点、系统睡眠时间(以防止被踢出服务器)和一个 if 参数以防止在最后一个有效页面被抓取后出现错误。
我正在尝试使用下一页功能从网站中提取多个页面。我创建了一个带有 nextpage 变量的数据框,并用 url.
开头填充了第一个值#building dataframe with variables
bframe <- data.frame(matrix(ncol = 3, nrow = 10000))
x <- c("curpage", "nexturl", "posttext")
colnames(bframe) <- x
#assigning first value for nexturl
bframe$nexturl[[1]] <- "http://www.ashleyannphotography.com/blog/2017/04/02/canopy-anna-turner/"
我想按如下方式提取文本(我知道代码很笨重——我是新手——但它确实得到了我想要的)
##create html object
blogfunc <- read_html("http://www.ashleyannphotography.com/blog/2017/04/02/canopy-anna-turner/")
##create object with post content scraped
posttext <- blogfunc %>%
html_nodes(".article-content")%>%
html_text()
posttext <- gsub('[\a]', '', blogfunc)
posttext <- gsub('[\t]', '', blogfunc)
posttext <- gsub('[\n]', '', blogfunc)
##scrape next url
nexturl <- blogfunc %>%
html_nodes(".prev-post-link-wrap a") %>%
html_attr("href")
关于将以上内容转换为函数并使用它来填充数据框有什么建议吗?我正在努力应用在线示例。
带睡眠时间的工作答案以及最后一个有效页面之后的参数。
```{r}
library(rvest)
url <- "http://www.ashleyannphotography.com/blog/2008/05/31/the-making-of-a-wet-willy/"
#Select first page.
getPostContent <- function(url){
Sys.sleep(2)
#Introduces pauses to convince server not robot.
read_html(url) %>%
html_nodes(".article-content")%>%
html_text() %>%
gsub(x = ., pattern = '[\a\t\n]', replacement = '')
}
#Pulls node for post content.
getDate <- function(url) {
Sys.sleep(2.6)
read_html(url) %>%
html_node(".updated") %>%
html_text()
}
#Pulls node for date.
getTitle <- function(url) {
Sys.sleep(.8)
read_html(url) %>%
html_node(".article-title") %>%
html_text()
}
#Pulls node for title.
getNextUrl <- function(url) {
Sys.sleep(.2)
read_html(url) %>%
html_node(".prev-post-link-wrap a") %>%
html_attr("href")
}
#Pulls node for url to previous post.
scrapeBackMap <- function(url, n){
Sys.sleep(3)
purrr::map_df(1:n, ~{
if(!is.na(url)){
#Only run if URL is not NA
oUrl <- url
date <- getDate(url)
post <- getPostContent(url)
title <- getTitle(url)
url <<- getNextUrl(url)
data.frame(curpage = oUrl,
nexturl = url,
posttext = post,
pubdate = date,
ptitle = title
#prepares functions for dataframe
)}
})
}
res <- scrapeBackMap(url, 3000)
class(res)
str(res)
#creates dataframe
```
我的想法是抓取每个 post 内容,找到 'previous post' url,导航到 url 并重复该过程。
library(rvest)
url <- "http://www.ashleyannphotography.com/blog/2017/04/02/canopy-anna-turner/"
抓取 post 的内容
getPostContent <- function(url){
read_html(url) %>%
html_nodes(".article-content")%>%
html_text() %>%
gsub(x = ., pattern = '[\a\t\n]', replacement = '')
}
刮下一个url
getNextUrl <- function(url) {
read_html(url) %>%
html_node(".prev-post-link-wrap a") %>%
html_attr("href")
}
一旦我们有了这些 'support' 函数,我们就可以将它们粘合在一起。
应用函数n
次
我猜 for
循环或 while
可以设置为继续直到 getNextUrl
return NULL
,但我更喜欢定义一个 n
的跳回并在每个 'jump' 处应用函数。
从原始 url
开始,我们检索其内容,然后用提取的新值覆盖 url
并继续,直到循环中断。
scrapeBackApply <- function(url, n) {
sapply(1:n, function(x) {
r <- getPostContent(url)
# Overwrite global 'url'
url <<- getNextUrl(url)
r
})
}
或者我们可以使用 purrr::map
系列,特别是 map_df
直接获得 data.frame
作为您的 bframe
。
scrapeBackMap <- function(url, n) {
purrr::map_df(1:n, ~{
oUrl <- url
post <- getPostContent(url)
url <<- getNextUrl(url)
data.frame(curpage = oUrl,
nexturl = url,
posttext = post)
})
}
结果
res <- scrapeBackApply(url, 2)
class(res)
#> [1] "character"
str(res)
#> chr [1:2] "Six years ago this month, my eldest/oldest/elder/older daughter<U+0085>Okay sidenote <U+0096> the #1 grammar correction I receive on a regula"| __truncated__ ...
res <- scrapeBackMap(url, 4)
class(res)
#> [1] "data.frame"
str(res)
#> 'data.frame': 4 obs. of 3 variables:
#> $ curpage : chr "http://www.ashleyannphotography.com/blog/2017/04/02/canopy-anna-turner/" "http://www.ashleyannphotography.com/blog/2017/03/31/a-guest-post-an-snapshop-interview/" "http://www.ashleyannphotography.com/blog/2017/03/29/explore-il-casey-small-town-big-things/" "http://www.ashleyannphotography.com/blog/2017/03/27/explore-ok-oklahoma-wondertorium/"
#> $ nexturl : chr "http://www.ashleyannphotography.com/blog/2017/03/31/a-guest-post-an-snapshop-interview/" "http://www.ashleyannphotography.com/blog/2017/03/29/explore-il-casey-small-town-big-things/" "http://www.ashleyannphotography.com/blog/2017/03/27/explore-ok-oklahoma-wondertorium/" "http://www.ashleyannphotography.com/blog/2017/03/24/the-youngest-cousin/"
#> $ posttext: chr "Six years ago this month, my eldest/oldest/elder/older daughter<U+0085>Okay sidenote <U+0096> the #1 grammar correction I receive on a regula"| __truncated__ "Today I am guest posting over on the Bought Beautifully blog about something new my family tried as a way to usher in our Easte"| __truncated__ "A couple of weeks ago, we drove to Illinois to watch one my nieces in a track meet and another niece in her high school musical"| __truncated__ "Often the activities we do as a family tend to cater more towards our older kids than the girls. The girls are always in the mi"| __truncated__