使用 praw 将 reddit 数据提取到 JSON 行

Fetching reddit data using praw into JSON Lines

所以我正在尝试使用 praw 获取 reddit 帖子数据并将其转换为 JSON Lines 文件。

我需要的是这样的:

{"context": ["Cross your redstone wires - Snapshot 20w18a is out", "But how will people get a blood spot effect now if the redstone default is a cross again?"], "response": ["Debug Stick?"], "id": "gabsj3"}
{"context": ["Cross your redstone wires - Snapshot 20w18a is out", "But how will people get a blood spot effect now if the redstone default is a cross again?", "Debug Stick?"], "response": ["My guess is the dot is flat out gone\n\nThere's no way for it to exist so why would they leave it in"], "id": "gabsj3"}
{"context": ["Cross your redstone wires - Snapshot 20w18a is out", "But how will people get a blood spot effect now if the redstone default is a cross again?", "Debug Stick?", "My guess is the dot is flat out gone\n\nThere's no way for it to exist so why would they leave it in"], "response": ["No, it's still in the game. Use the debug stick to set all sides to `none`"], "id": "gabsj3"}

因此上下文包含 ["POST TITLE"、"FIRST LEVEL COMMENT"、"SECOND LEVEL COMMENT"、"ETC..."],响应包含最后一级评论。在这个post on reddit中,应该是:

{"context": ["Cross your redstone wires - Snapshot 20w18a is out", "But how will people get a blood spot effect now if the redstone default is a cross again?", "Debug Stick?", "My guess is the dot is flat out gone\n\nThere's no way for it to exist so why would they leave it in", "No, it's still in the game. Use the debug stick to set all sides to `none`"], "response": ["Huh, alright"], "id": "gabsj3"}

但是我的代码输出是这样的:

{"context": ["Cross your redstone wires - Snapshot 20w18a is out", "But how will people get a blood spot effect now if the redstone default is a cross again?"], "response": ["Debug Stick?", "I think we can still use resource packs to change it back into a dot, I don't know so don't quote me on that", "I honestly think the cross redstone looks a bit more like a splatter."], "id": "gabsj3"}

这是我的代码:

import praw
import jsonlines

reddit = praw.Reddit(client_id='-', client_secret='-', user_agent='user_agent')

max = 1000
sequence =1
for post in reddit.subreddit('minecraft').new(limit=max):
data = []
title = []
comment = []
response = []
post_id = post.id
titl = post.title
# print("https://www.reddit.com/"+post.permalink)

print("Fetched "+str(sequence) + " posts .. ")
title.append(titl)
try:
    submission = reddit.submission(id=post_id)
    submission.comments.replace_more(limit=None)
    sequence = sequence + 1

    for top_level_comment in submission.comments:
        cmnt_body = top_level_comment.body
        comment.append(cmnt_body)
        for second_level_comment in top_level_comment.replies:
            response.append(second_level_comment.body)
        context = [title[0],comment[0]]
        data.append({"context":context,"response":response,"id":post_id})
        response = []
        # print(data[0])
        with jsonlines.open('2020-04-30_12.jsonl', mode='a') as writer:
            writer.write(data.pop())
        comment.pop()
    title.pop()


except Exception :
    pass

这是一种有趣的存储数据的方式。我不能说我自己会使用这种方法,因为它涉及一遍又一遍地复制相同的信息。

为此,您需要管理一个包含当前上下文的 stack,并使用递归来获取每个评论的子项:

import jsonlines
import praw

reddit = praw.Reddit(...)  # fill in with your authentication


def main():
    for post in reddit.subreddit("minecraft").new(limit=1000):
        dump_replies(replies=post.comments, context=[post.title])


def dump_replies(replies, context):
    for reply in replies:
        if isinstance(reply, praw.models.MoreComments):
            continue

        reply_data = {
            "context": context,
            "response": reply.body,
            "id": reply.submission.id,
        }
        with jsonlines.open("2020-04-30_12.jsonl", mode="a") as writer:
            writer.write(reply_data)

        context.append(reply.body)
        dump_replies(reply.replies, context)
        context.pop()


main()

在每次递归调用之前,我们将当前项目的正文附加到上下文列表中,然后在递归后将其删除。这会构建一个堆栈,显示当前评论的路径。然后对于每条评论,我们转储其上下文、正文和提交 ID。

请注意,这不会为没有评论的 post 转储任何内容,这似乎符合您示例数据中的策略(因为每一行都代表一条评论,是对别的东西)。