试图将 Reddit JSON 扁平化为多个 "conversations"
Trying to flatten a Reddit JSON into many "conversations"
我正在尝试使用 Reddit 线程的评论作为机器学习程序的训练集。输入的一个例子是 https://old.reddit.com/r/bayarea/comments/cxxl9y/billionaires_yacht_docked_in_embarcadero.json.
我正在过滤 body、id 和 parent_id,希望将嵌套的 JSON 变成许多对话。
例如,如果输入是["A", ["B",["C", "D"]]]
,输出应该是["A", "B", "C"], ["A","B","D"]
。
下面是我当前的代码:
json_url = "https://old.reddit.com/r/bayarea/comments/cxxl9y/billionaires_yacht_docked_in_embarcadero.json"
r = requests.get(json_url, headers={"user-agent": "PostmanRuntime/7.15.2"})
comments_tree_raw = fltr(r.json(), ["ups", "body", "id", "parent_id"])[1]["data"]
comments_tree_raw = flatten([], comments_tree_raw["children"])
def remove_all_after(node, index):
target = node.index(index)
return node[:target]
training_threads = []
# input the children list
def flatten(output, children):
global training_threads
for child in children:
try:
child_obj = child["data"] if "body" in child["data"] else child
child_comment = {
"body": child_obj["body"],
"id": child_obj["id"],
"parent": child_obj["parent_id"]
}
output.append(child_comment)
except KeyError:
continue
if "replies" not in child["data"]:
training_threads.append(output.copy())
parent_id = child_comment["parent"].split("_")[1]
for i in output:
if i["id"] == parent_id:
output = remove_all_after(output, i)
break
continue
flatten(output, child["data"]["replies"]["data"]["children"])
在这里,我试图递归地解决问题,但它没有产生我需要的输出。这是我得到的输出:https://pastebin.com/GkpwGUtK.
非常感谢您的帮助!谢谢!
您可以对生成器使用简单的递归:
data = ["A", ["B",["C", "D"]]]
def group(d, c = []):
a, b = d
if all(not isinstance(i, list) for i in b):
yield from [c+[a, i] for i in b]
else:
yield from group(b, c+[a])
print(list(group(data)))
输出:
[['A', 'B', 'C'], ['A', 'B', 'D']]
编辑:使用 itertools.groupby
的更完整版本:
from itertools import groupby
def group(d, c = []):
_d = [list(b) for _, b in groupby(d, key=lambda x:isinstance(x, list))]
if len(_d) == 1:
for i in _d[0]:
if not isinstance(i, list):
yield c+[i]
else:
yield from group(i, c)
else:
for i in range(0, len(_d), 2):
for k in _d[i]:
yield from group(_d[i+1], c+[k])
print(list(group([["C", ["D", "E"], ["C", ["D", "E"], ["C", ["D", "E"]]]]])))
输出:
[['C', 'D'], ['C', 'E'], ['C', 'C', 'D'], ['C', 'C', 'E'], ['C', 'C', 'C', 'D'], ['C', 'C', 'C', 'E']]
我正在尝试使用 Reddit 线程的评论作为机器学习程序的训练集。输入的一个例子是 https://old.reddit.com/r/bayarea/comments/cxxl9y/billionaires_yacht_docked_in_embarcadero.json.
我正在过滤 body、id 和 parent_id,希望将嵌套的 JSON 变成许多对话。
例如,如果输入是["A", ["B",["C", "D"]]]
,输出应该是["A", "B", "C"], ["A","B","D"]
。
下面是我当前的代码:
json_url = "https://old.reddit.com/r/bayarea/comments/cxxl9y/billionaires_yacht_docked_in_embarcadero.json"
r = requests.get(json_url, headers={"user-agent": "PostmanRuntime/7.15.2"})
comments_tree_raw = fltr(r.json(), ["ups", "body", "id", "parent_id"])[1]["data"]
comments_tree_raw = flatten([], comments_tree_raw["children"])
def remove_all_after(node, index):
target = node.index(index)
return node[:target]
training_threads = []
# input the children list
def flatten(output, children):
global training_threads
for child in children:
try:
child_obj = child["data"] if "body" in child["data"] else child
child_comment = {
"body": child_obj["body"],
"id": child_obj["id"],
"parent": child_obj["parent_id"]
}
output.append(child_comment)
except KeyError:
continue
if "replies" not in child["data"]:
training_threads.append(output.copy())
parent_id = child_comment["parent"].split("_")[1]
for i in output:
if i["id"] == parent_id:
output = remove_all_after(output, i)
break
continue
flatten(output, child["data"]["replies"]["data"]["children"])
在这里,我试图递归地解决问题,但它没有产生我需要的输出。这是我得到的输出:https://pastebin.com/GkpwGUtK.
非常感谢您的帮助!谢谢!
您可以对生成器使用简单的递归:
data = ["A", ["B",["C", "D"]]]
def group(d, c = []):
a, b = d
if all(not isinstance(i, list) for i in b):
yield from [c+[a, i] for i in b]
else:
yield from group(b, c+[a])
print(list(group(data)))
输出:
[['A', 'B', 'C'], ['A', 'B', 'D']]
编辑:使用 itertools.groupby
的更完整版本:
from itertools import groupby
def group(d, c = []):
_d = [list(b) for _, b in groupby(d, key=lambda x:isinstance(x, list))]
if len(_d) == 1:
for i in _d[0]:
if not isinstance(i, list):
yield c+[i]
else:
yield from group(i, c)
else:
for i in range(0, len(_d), 2):
for k in _d[i]:
yield from group(_d[i+1], c+[k])
print(list(group([["C", ["D", "E"], ["C", ["D", "E"], ["C", ["D", "E"]]]]])))
输出:
[['C', 'D'], ['C', 'E'], ['C', 'C', 'D'], ['C', 'C', 'E'], ['C', 'C', 'C', 'D'], ['C', 'C', 'C', 'E']]