Reddit 抓取错误 HTTP 状态为“403 Forbidden”

Question

我正在尝试使用 RedditExtractoR

抓取 Reddit

我的查询

df = get_reddit(search_terms = "blockchain",page_threshold = 2)

显示错误为

cannot open URL 'http://www.reddit.com/r/CryptoCurrency/comments/7vga1y/i_will_tell_you_exactly_what_is_going_on_here/.json?limit=500': HTTP status was '403 Forbidden'cannot open URL 'http://www.reddit.com/r/IAmA/comments/blssl3/my_name_is_benjamin_zhang_and_im_a_transportation/.json?limit=500': HTTP status was '403 Forbidden'cannot open URL 'http://www.reddit.com/r/IAmA/comments/blssl3/my_name_is_benjamin_zhang_and_im_a_transportation/.json?limit=500': HTTP status was '403 Forbidden'3 Forbidden'

我该如何解决？

Answer 1

403 forbidden 的常见原因是： a) 服务器问题 b) 被阻止

我创建了一个使用 get_reddit 提取器库的 R 程序来测试您是否会因使用它而被阻止

library(RedditExtractoR)

blockchain <- get_reddit(
    search_terms = "blockchain",
    page_threshold = 2,
)

而且无论我如何运行它都像一个魅力。幸运的是，RedditExtractoR 有 built-in 速率限制以防止出现问题。根据 redditExtractorR 包文档：

Question: All functions in this library appear to be a little slow, why is that?
Answer: The Reddit API allows users to make 60 requests per minute (1 request per second), which is why URL parsers used in this library intentionally limit requests to conform to the API requirements

但是由于被阻止而导致 403 错误的可能性仍然存在。

如果您开始收到 403：

等几天，确保不是 API 有问题
使用抓取服务或 VPN

Reddit 抓取错误 HTTP 状态为“403 Forbidden”

Reddit Scraping Error HTTP status was '403 Forbidden'

r

reddit

web-scraping