如何在给定时间段内抓取所有 subreddit 帖子

Question

我有一个功能可以抓取 2014-11-01 和 2015-10-31 之间比特币 subreddit 中的所有帖子。

但是，我只能提取大约 990 条可以追溯到 10 月 25 日的帖子。我不明白发生了什么。在参考 https://github.com/reddit/reddit/wiki/API 之后，我在每个摘录之间加入了 15 秒的 Sys.sleep，但无济于事。

此外，我尝试从另一个 subreddit（健身）中抓取，但它也返回了大约 900 个帖子。

require(jsonlite)
require(dplyr)

getAllPosts <- function() {
    url <- "https://www.reddit.com/r/bitcoin/search.json?q=timestamp%3A1414800000..1446335999&sort=new&restrict_sr=on&rank=title&syntax=cloudsearch&limit=100"
    extract <- fromJSON(url)
    posts <- extract$data$children$data %>% dplyr::select(name, author,   num_comments, created_utc,
                                             title, selftext)  
    after <- posts[nrow(posts),1]
    url.next <- paste0("https://www.reddit.com/r/bitcoin/search.json?q=timestamp%3A1414800000..1446335999&sort=new&restrict_sr=on&rank=title&syntax=cloudsearch&after=",after,"&limit=100")
    extract.next <- fromJSON(url.next)
    posts.next <- extract.next$data$children$data

    # execute while loop as long as there are any rows in the data frame
    while (!is.null(nrow(posts.next))) {
        posts.next <- posts.next %>% dplyr::select(name, author, num_comments, created_utc, 
                                    title, selftext)
        posts <- rbind(posts, posts.next)
        after <- posts[nrow(posts),1]
        url.next <- paste0("https://www.reddit.com/r/bitcoin/search.json?q=timestamp%3A1414800000..1446335999&sort=new&restrict_sr=on&rank=title&syntax=cloudsearch&after=",after,"&limit=100")
        Sys.sleep(15)
        extract <- fromJSON(url.next)
        posts.next <- extract$data$children$data
    }
    posts$created_utc <- as.POSIXct(posts$created_utc, origin="1970-01-01")
    return(posts)
}

posts <- getAllPosts()

reddit 是否有我正在达到的某种限制？

Answer 1

是的，所有 reddit 列表（帖子、评论等）的上限为 1000 个；出于性能原因，它们本质上只是缓存列表，而不是查询。

要解决这个问题，您需要进行一些巧妙的搜索 based on timestamps。

如何在给定时间段内抓取所有 subreddit 帖子

How to scrape all subreddit posts in a given time period

r

reddit

text-mining

web-scraping