PushshiftAPI 不返回所有评论
PushshiftAPI not returning all comments
我正在使用以下代码获取给定 Reddit post 的评论。我们只想要 top/first-level 评论,但这个过滤器还没有实现,因为我们无法得到这个基本代码 returning 我们所期望的:
import pandas as pd
import datetime as dt
from pmaw import PushshiftAPI
comments = pd.DataFrame()
api = PushshiftAPI()
subreddit = "Conservative"
limit = 100000
# ids are loaded from another df in original code, but list of 3 here for simplicity
ids = ['ly98ob', 'lxku9i', 'lxzjv5']
# main loop
for id in ids:
# get comments for this post using the link_id parameter
new_comments = api.search_comments(subreddit=subreddit, link_id=id)
# TROUBLE IS HERE^^-----------------------^^ returns only ~26 comments
new_comments = pd.DataFrame(new_comments)
# add new comments to commentsdataframe
comments = pd.concat([comments, new_comments], sort=False, ignore_index=True)
# some additional prints and save to csv is also in the code
我查看了 this Reddit Pushshift post, but even the api call: https://api.pushshift.io/reddit/comment/search/?link_id=ly98ob 的解决方案并没有达到超过 25 条评论。
我希望 api.search_comments(...) 到 return 的评论比我们现在得到的 ~26 多得多。为了抓取给定 post id 的所有评论,我在代码中是否遗漏了任何(明显的)东西或错误?
由于某种原因,search_comments 和 search_submission_comment_ids 方法在 2021 年 11 月 26 日之后无法 return 任何评论。在解决这个问题之前,这里有一个我为自己的用途实施的快速解决方法,它混合了 PMAW(按日期获取提交)和 PRAW(获取这些提交的评论):
submissions = api_praw.search_submissions(subreddit=subreddit, before=before, after=after, limit=10)
sub_list = [sub for sub in submissions]
try:
# [['subreddit', 'title', 'selftext', 'author', 'score', 'created_utc', 'id', 'num_comments', 'permalink', 'upvote_ratio']]
sub_df = pd.DataFrame(sub_list)
sub_df['permalink'] = 'www.reddit.com' + sub_df['permalink']
sub_ids = list(sub_df['id'])
comment_list = []
for sub_id in sub_ids:
submission = reddit.submission(sub_id)
submission.comments.replace_more(limit=None)
for comment in submission.comments.list():
comment_list.append(comment.__dict__)
comments_df = pd.DataFrame(comment_list)
我正在使用以下代码获取给定 Reddit post 的评论。我们只想要 top/first-level 评论,但这个过滤器还没有实现,因为我们无法得到这个基本代码 returning 我们所期望的:
import pandas as pd
import datetime as dt
from pmaw import PushshiftAPI
comments = pd.DataFrame()
api = PushshiftAPI()
subreddit = "Conservative"
limit = 100000
# ids are loaded from another df in original code, but list of 3 here for simplicity
ids = ['ly98ob', 'lxku9i', 'lxzjv5']
# main loop
for id in ids:
# get comments for this post using the link_id parameter
new_comments = api.search_comments(subreddit=subreddit, link_id=id)
# TROUBLE IS HERE^^-----------------------^^ returns only ~26 comments
new_comments = pd.DataFrame(new_comments)
# add new comments to commentsdataframe
comments = pd.concat([comments, new_comments], sort=False, ignore_index=True)
# some additional prints and save to csv is also in the code
我查看了 this Reddit Pushshift post, but even the api call: https://api.pushshift.io/reddit/comment/search/?link_id=ly98ob 的解决方案并没有达到超过 25 条评论。
我希望 api.search_comments(...) 到 return 的评论比我们现在得到的 ~26 多得多。为了抓取给定 post id 的所有评论,我在代码中是否遗漏了任何(明显的)东西或错误?
由于某种原因,search_comments 和 search_submission_comment_ids 方法在 2021 年 11 月 26 日之后无法 return 任何评论。在解决这个问题之前,这里有一个我为自己的用途实施的快速解决方法,它混合了 PMAW(按日期获取提交)和 PRAW(获取这些提交的评论):
submissions = api_praw.search_submissions(subreddit=subreddit, before=before, after=after, limit=10)
sub_list = [sub for sub in submissions]
try:
# [['subreddit', 'title', 'selftext', 'author', 'score', 'created_utc', 'id', 'num_comments', 'permalink', 'upvote_ratio']]
sub_df = pd.DataFrame(sub_list)
sub_df['permalink'] = 'www.reddit.com' + sub_df['permalink']
sub_ids = list(sub_df['id'])
comment_list = []
for sub_id in sub_ids:
submission = reddit.submission(sub_id)
submission.comments.replace_more(limit=None)
for comment in submission.comments.list():
comment_list.append(comment.__dict__)
comments_df = pd.DataFrame(comment_list)