PushshiftAPI 不返回所有评论

Question

我正在使用以下代码获取给定 Reddit post 的评论。我们只想要 top/first-level 评论，但这个过滤器还没有实现，因为我们无法得到这个基本代码 returning 我们所期望的：

import pandas as pd
import datetime as dt
from pmaw import PushshiftAPI

comments = pd.DataFrame()
api = PushshiftAPI()
subreddit = "Conservative"
limit = 100000

# ids are loaded from another df in original code, but list of 3 here for simplicity
ids = ['ly98ob', 'lxku9i', 'lxzjv5']

# main loop
for id in ids:
    # get comments for this post using the link_id parameter
    new_comments = api.search_comments(subreddit=subreddit, link_id=id)
    # TROUBLE IS HERE^^-----------------------^^ returns only ~26 comments
    new_comments = pd.DataFrame(new_comments)

    # add new comments to commentsdataframe
    comments = pd.concat([comments, new_comments], sort=False, ignore_index=True)

# some additional prints and save to csv is also in the code

我查看了 this Reddit Pushshift post, but even the api call: https://api.pushshift.io/reddit/comment/search/?link_id=ly98ob 的解决方案并没有达到超过 25 条评论。

我希望 api.search_comments(...) 到 return 的评论比我们现在得到的 ~26 多得多。为了抓取给定 post id 的所有评论，我在代码中是否遗漏了任何（明显的）东西或错误？

Answer 1

由于某种原因，search_comments 和 search_submission_comment_ids 方法在 2021 年 11 月 26 日之后无法 return 任何评论。在解决这个问题之前，这里有一个我为自己的用途实施的快速解决方法，它混合了 PMAW（按日期获取提交）和 PRAW（获取这些提交的评论）：

  submissions = api_praw.search_submissions(subreddit=subreddit, before=before, after=after, limit=10)
sub_list = [sub for sub in submissions]
try:
    # [['subreddit', 'title', 'selftext', 'author', 'score', 'created_utc', 'id', 'num_comments', 'permalink', 'upvote_ratio']]
    sub_df = pd.DataFrame(sub_list)
    sub_df['permalink'] = 'www.reddit.com' + sub_df['permalink']

    sub_ids = list(sub_df['id'])

    comment_list = []
    for sub_id in sub_ids:
        submission = reddit.submission(sub_id)
        submission.comments.replace_more(limit=None)
        for comment in submission.comments.list():
            comment_list.append(comment.__dict__)
            
    comments_df = pd.DataFrame(comment_list)

PushshiftAPI 不返回所有评论

PushshiftAPI not returning all comments

python

reddit

web-scraping