"Next Page" rvest scrape 的函数

Function for "Next Page" rvest scrape

我已经在底部添加了我使用的最终代码,以防有人有类似的问题。我使用了下面提供的答案,但添加了几个节点、系统睡眠时间(以防止被踢出服务器)和一个 if 参数以防止在最后一个有效页面被抓取后出现错误。

我正在尝试使用下一页功能从网站中提取多个页面。我创建了一个带有 nextpage 变量的数据框,并用 url.

开头填充了第一个值
#building dataframe with variables
bframe <- data.frame(matrix(ncol = 3, nrow = 10000))
x <- c("curpage", "nexturl", "posttext")
colnames(bframe) <- x

#assigning first value for nexturl
bframe$nexturl[[1]] <- "http://www.ashleyannphotography.com/blog/2017/04/02/canopy-anna-turner/"

我想按如下方式提取文本(我知道代码很笨重——我是新手——但它确实得到了我想要的)

##create html object
blogfunc    <-  read_html("http://www.ashleyannphotography.com/blog/2017/04/02/canopy-anna-turner/")
##create object with post content scraped
posttext    <-  blogfunc    %>% 
    html_nodes(".article-content")%>%           
    html_text()                 
posttext    <-  gsub('[\a]', '', blogfunc)
posttext    <-  gsub('[\t]', '', blogfunc)
posttext    <-  gsub('[\n]', '', blogfunc)
##scrape next url
nexturl <-  blogfunc    %>% 
    html_nodes(".prev-post-link-wrap a") %>%    
    html_attr("href")           

关于将以上内容转换为函数并使用它来填充数据框有什么建议吗?我正在努力应用在线示例。


带睡眠时间的工作答案以及最后一个有效页面之后的参数。

```{r}
library(rvest)    
url <- "http://www.ashleyannphotography.com/blog/2008/05/31/the-making-of-a-wet-willy/"
#Select first page.

getPostContent <- function(url){
    Sys.sleep(2)
    #Introduces pauses to convince server not robot.
    read_html(url) %>% 
        html_nodes(".article-content")%>%           
        html_text() %>% 
        gsub(x = ., pattern = '[\a\t\n]', replacement = '')
  }
#Pulls node for post content.

getDate <- function(url) {
    Sys.sleep(2.6)
    read_html(url) %>% 
        html_node(".updated") %>%
        html_text()
}
#Pulls node for date.

getTitle <- function(url) {
    Sys.sleep(.8)
    read_html(url) %>% 
        html_node(".article-title") %>%
        html_text()
    }
#Pulls node for title.

getNextUrl <- function(url) {
    Sys.sleep(.2)
    read_html(url) %>% 
        html_node(".prev-post-link-wrap a") %>%
        html_attr("href")
    }
#Pulls node for url to previous post.

scrapeBackMap <- function(url, n){
    Sys.sleep(3)
    purrr::map_df(1:n, ~{
        if(!is.na(url)){
#Only run if URL is not NA
        oUrl <- url
        date <- getDate(url)
        post <- getPostContent(url)
        title <- getTitle(url)
        url <<- getNextUrl(url)

        data.frame(curpage = oUrl, 
                        nexturl = url,
                        posttext = post,
                        pubdate = date,
                        ptitle = title
#prepares functions for dataframe
                        )}
    })
}
   res <- scrapeBackMap(url, 3000)
   class(res)
   str(res)
#creates dataframe
```

我的想法是抓取每个 post 内容,找到 'previous post' url,导航到 url 并重复该过程。

library(rvest)    

url <-  "http://www.ashleyannphotography.com/blog/2017/04/02/canopy-anna-turner/"

抓取 post 的内容

getPostContent <- function(url){
    read_html(url) %>% 
        html_nodes(".article-content")%>%           
        html_text() %>% 
        gsub(x = ., pattern = '[\a\t\n]', replacement = '')
    }

刮下一个url

getNextUrl <- function(url) {
    read_html(url) %>% 
        html_node(".prev-post-link-wrap a") %>%
        html_attr("href")
}

一旦我们有了这些 'support' 函数,我们就可以将它们粘合在一起。

应用函数n

我猜 for 循环或 while 可以设置为继续直到 getNextUrl return NULL,但我更喜欢定义一个 n 的跳回并在每个 'jump' 处应用函数。

从原始 url 开始,我们检索其内容,然后用提取的新值覆盖 url 并继续,直到循环中断。

scrapeBackApply <- function(url, n) {
    sapply(1:n, function(x) {
        r <- getPostContent(url)
        # Overwrite global 'url'
        url <<- getNextUrl(url)
        r
    })
}

或者我们可以使用 purrr::map 系列,特别是 map_df 直接获得 data.frame 作为您的 bframe

scrapeBackMap <- function(url, n) {
    purrr::map_df(1:n, ~{
        oUrl <- url
        post <- getPostContent(url)
        url <<- getNextUrl(url)
        data.frame(curpage = oUrl, 
                        nexturl = url,
                        posttext = post)
    })
}

结果

res <- scrapeBackApply(url, 2)
class(res)
#> [1] "character"
str(res)
#>  chr [1:2] "Six years ago this month, my eldest/oldest/elder/older daughter<U+0085>Okay sidenote <U+0096> the #1 grammar correction I receive on a regula"| __truncated__ ...

res <- scrapeBackMap(url, 4)
class(res)
#> [1] "data.frame"
str(res)
#> 'data.frame':    4 obs. of  3 variables:
#>  $ curpage : chr  "http://www.ashleyannphotography.com/blog/2017/04/02/canopy-anna-turner/" "http://www.ashleyannphotography.com/blog/2017/03/31/a-guest-post-an-snapshop-interview/" "http://www.ashleyannphotography.com/blog/2017/03/29/explore-il-casey-small-town-big-things/" "http://www.ashleyannphotography.com/blog/2017/03/27/explore-ok-oklahoma-wondertorium/"
#>  $ nexturl : chr  "http://www.ashleyannphotography.com/blog/2017/03/31/a-guest-post-an-snapshop-interview/" "http://www.ashleyannphotography.com/blog/2017/03/29/explore-il-casey-small-town-big-things/" "http://www.ashleyannphotography.com/blog/2017/03/27/explore-ok-oklahoma-wondertorium/" "http://www.ashleyannphotography.com/blog/2017/03/24/the-youngest-cousin/"
#>  $ posttext: chr  "Six years ago this month, my eldest/oldest/elder/older daughter<U+0085>Okay sidenote <U+0096> the #1 grammar correction I receive on a regula"| __truncated__ "Today I am guest posting over on the Bought Beautifully blog about something new my family tried as a way to usher in our Easte"| __truncated__ "A couple of weeks ago, we drove to Illinois to watch one my nieces in a track meet and another niece in her high school musical"| __truncated__ "Often the activities we do as a family tend to cater more towards our older kids than the girls. The girls are always in the mi"| __truncated__