使用 praw 将 reddit 数据提取到 JSON 行

Fetching reddit data using praw into JSON Lines

所以我正在尝试使用 praw 获取 reddit 帖子数据并将其转换为 JSON Lines 文件。


{"context": ["Cross your redstone wires - Snapshot 20w18a is out", "But how will people get a blood spot effect now if the redstone default is a cross again?"], "response": ["Debug Stick?"], "id": "gabsj3"}
{"context": ["Cross your redstone wires - Snapshot 20w18a is out", "But how will people get a blood spot effect now if the redstone default is a cross again?", "Debug Stick?"], "response": ["My guess is the dot is flat out gone\n\nThere's no way for it to exist so why would they leave it in"], "id": "gabsj3"}
{"context": ["Cross your redstone wires - Snapshot 20w18a is out", "But how will people get a blood spot effect now if the redstone default is a cross again?", "Debug Stick?", "My guess is the dot is flat out gone\n\nThere's no way for it to exist so why would they leave it in"], "response": ["No, it's still in the game. Use the debug stick to set all sides to `none`"], "id": "gabsj3"}

因此上下文包含 ["POST TITLE"、"FIRST LEVEL COMMENT"、"SECOND LEVEL COMMENT"、"ETC..."],响应包含最后一级评论。在这个post on reddit中,应该是:

{"context": ["Cross your redstone wires - Snapshot 20w18a is out", "But how will people get a blood spot effect now if the redstone default is a cross again?", "Debug Stick?", "My guess is the dot is flat out gone\n\nThere's no way for it to exist so why would they leave it in", "No, it's still in the game. Use the debug stick to set all sides to `none`"], "response": ["Huh, alright"], "id": "gabsj3"}


{"context": ["Cross your redstone wires - Snapshot 20w18a is out", "But how will people get a blood spot effect now if the redstone default is a cross again?"], "response": ["Debug Stick?", "I think we can still use resource packs to change it back into a dot, I don't know so don't quote me on that", "I honestly think the cross redstone looks a bit more like a splatter."], "id": "gabsj3"}


import praw
import jsonlines

reddit = praw.Reddit(client_id='-', client_secret='-', user_agent='user_agent')

max = 1000
sequence =1
for post in reddit.subreddit('minecraft').new(limit=max):
data = []
title = []
comment = []
response = []
post_id = post.id
titl = post.title
# print("https://www.reddit.com/"+post.permalink)

print("Fetched "+str(sequence) + " posts .. ")
    submission = reddit.submission(id=post_id)
    sequence = sequence + 1

    for top_level_comment in submission.comments:
        cmnt_body = top_level_comment.body
        for second_level_comment in top_level_comment.replies:
        context = [title[0],comment[0]]
        response = []
        # print(data[0])
        with jsonlines.open('2020-04-30_12.jsonl', mode='a') as writer:

except Exception :


为此,您需要管理一个包含当前上下文的 stack,并使用递归来获取每个评论的子项:

import jsonlines
import praw

reddit = praw.Reddit(...)  # fill in with your authentication

def main():
    for post in reddit.subreddit("minecraft").new(limit=1000):
        dump_replies(replies=post.comments, context=[post.title])

def dump_replies(replies, context):
    for reply in replies:
        if isinstance(reply, praw.models.MoreComments):

        reply_data = {
            "context": context,
            "response": reply.body,
            "id": reply.submission.id,
        with jsonlines.open("2020-04-30_12.jsonl", mode="a") as writer:

        dump_replies(reply.replies, context)


在每次递归调用之前,我们将当前项目的正文附加到上下文列表中,然后在递归后将其删除。这会构建一个堆栈,显示当前评论的路径。然后对于每条评论,我们转储其上下文、正文和提交 ID。

请注意,这不会为没有评论的 post 转储任何内容,这似乎符合您示例数据中的策略(因为每一行都代表一条评论,是对别的东西)。