如何从回复 ID (Python) 中获取 threads/conversations?
How to get threads/conversations from reply ids (Python)?
我是 python 的相对新手,我正在尝试从具有 ID 列表的数据框中重建 conversations/threads。
我目前有一个 pandas 推文/reddit 帖子的数据框,其格式大致如下:
id
text
parent_id
replies
id1
blah blah
_ post _
id2, id3, id4, id5, id6, id7
id2
blah blah
id1
id4, id5, id6, id7
id3
blah blah
id1
id4
blah blah
id2
id6, id7
id5
blah blah
id2
id6
blah blah
id4
id7
id7
blah blah
id6
我的目标是根据 ID 将数据分成 threads/conversations。这意味着,从上面的例子中,得到以下序列作为输出:
[id1, id2, id4, id6],
[id1, id2, id4, id7],
[id1, id2, id5], &
[id1, id3].
拥有这些列表将使我能够完整地查看线程。目前我的代码非常复杂,看起来像这样:
out_list = []
for i, row in df.iterrows():
id_ = row["id"]
# create our output file
sequence = [id_]
replies = list(row['replies'])
# creates a new dataframe from the replies to the topline comment in question
reply_df= df.loc[df['id'].isin(replies)]
reply_df = reply_df[reply_df.Parent_id2 == id_]
#check if ends at topline
if reply_df.empty == False:
def turn_recursion(df, reply_df):
for j, row_ in reply_df.iterrows():
replies_2 = reply_df.loc[j, 'replies']
id_2 = row_["id"]
reply_df2 = df.loc[df['id'].isin(replies_2)]
reply_df2 = reply_df2[reply_df2.Parent_id2 == id_2]
nonlocal sequence
nonlocal out_list
if reply_df2.empty == False:
sequence.append(id_2)
return(turn_recursion(df, reply_df2))
else:
sequence.append(id_2)
out_list.append(sequence)
turn_recursion(test2, reply_df)
else:
out_list.append(sequence)
这目前给我的结果是半准确的,但我得到的不是:[[id1, id2, id4, id6],[id1, id2, id4, id7]],而是:[id1, id2, id4 , id6, id7].
我意识到我可能有点昏昏欲睡,有一个简单的解决方案,但就我的生活而言,我似乎无法找到一种方法来做到这一点,以便它正常工作并且任意螺纹长度。
提前感谢您提出任何建议。 :)
使用networkx
实现你想要的:
import pandas as pd
import networkx as nx
from collections import defaultdict
data = defaultdict(list)
# Build graph from pandas
G = nx.from_pandas_edgelist(df, source='parent_id', target='id',
create_using=nx.DiGraph)
# Find leaves (id3, id5, id7)
leaves = [node for node, degree in G.out_degree() if degree == 0]
# Enumerate all possible paths
for node in df['id']:
for leaf in leaves:
for path in nx.all_simple_paths(G, node, leaf):
data[node].append(path)
输出:
>>> data
defaultdict(list,
{'id1': [['id1', 'id3'],
['id1', 'id2', 'id5'],
['id1', 'id2', 'id4', 'id6', 'id7']],
'id2': [['id2', 'id5'], ['id2', 'id4', 'id6', 'id7']],
'id4': [['id4', 'id6', 'id7']],
'id6': [['id6', 'id7']]})
如果您想将字典合并到您的数据框:
df['replies'] = df['id'].map(data)
print(df)
# Output:
id text parent_id replies
0 id1 blah blah _ post _ [[id1, id3], [id1, id2, id5], [id1, id2, id4, ...
1 id2 blah blah id1 [[id2, id5], [id2, id4, id6, id7]]
2 id3 blah blah id1 []
3 id4 blah blah id2 [[id4, id6, id7]]
4 id5 blah blah id2 []
5 id6 blah blah id4 [[id6, id7]]
6 id7 blah blah id6 []
现在你可以分解你的数据框了:
df = df.explode('replies')
print(df)
# Output:
id text parent_id replies
0 id1 blah blah _ post _ [id1, id3]
0 id1 blah blah _ post _ [id1, id2, id5]
0 id1 blah blah _ post _ [id1, id2, id4, id6, id7]
1 id2 blah blah id1 [id2, id5]
1 id2 blah blah id1 [id2, id4, id6, id7]
2 id3 blah blah id1 NaN
3 id4 blah blah id2 [id4, id6, id7]
4 id5 blah blah id2 NaN
5 id6 blah blah id4 [id6, id7]
6 id7 blah blah id6 NaN
我是 python 的相对新手,我正在尝试从具有 ID 列表的数据框中重建 conversations/threads。
我目前有一个 pandas 推文/reddit 帖子的数据框,其格式大致如下:
id | text | parent_id | replies |
---|---|---|---|
id1 | blah blah | _ post _ | id2, id3, id4, id5, id6, id7 |
id2 | blah blah | id1 | id4, id5, id6, id7 |
id3 | blah blah | id1 | |
id4 | blah blah | id2 | id6, id7 |
id5 | blah blah | id2 | |
id6 | blah blah | id4 | id7 |
id7 | blah blah | id6 |
我的目标是根据 ID 将数据分成 threads/conversations。这意味着,从上面的例子中,得到以下序列作为输出:
[id1, id2, id4, id6],
[id1, id2, id4, id7],
[id1, id2, id5], &
[id1, id3].
拥有这些列表将使我能够完整地查看线程。目前我的代码非常复杂,看起来像这样:
out_list = []
for i, row in df.iterrows():
id_ = row["id"]
# create our output file
sequence = [id_]
replies = list(row['replies'])
# creates a new dataframe from the replies to the topline comment in question
reply_df= df.loc[df['id'].isin(replies)]
reply_df = reply_df[reply_df.Parent_id2 == id_]
#check if ends at topline
if reply_df.empty == False:
def turn_recursion(df, reply_df):
for j, row_ in reply_df.iterrows():
replies_2 = reply_df.loc[j, 'replies']
id_2 = row_["id"]
reply_df2 = df.loc[df['id'].isin(replies_2)]
reply_df2 = reply_df2[reply_df2.Parent_id2 == id_2]
nonlocal sequence
nonlocal out_list
if reply_df2.empty == False:
sequence.append(id_2)
return(turn_recursion(df, reply_df2))
else:
sequence.append(id_2)
out_list.append(sequence)
turn_recursion(test2, reply_df)
else:
out_list.append(sequence)
这目前给我的结果是半准确的,但我得到的不是:[[id1, id2, id4, id6],[id1, id2, id4, id7]],而是:[id1, id2, id4 , id6, id7].
我意识到我可能有点昏昏欲睡,有一个简单的解决方案,但就我的生活而言,我似乎无法找到一种方法来做到这一点,以便它正常工作并且任意螺纹长度。
提前感谢您提出任何建议。 :)
使用networkx
实现你想要的:
import pandas as pd
import networkx as nx
from collections import defaultdict
data = defaultdict(list)
# Build graph from pandas
G = nx.from_pandas_edgelist(df, source='parent_id', target='id',
create_using=nx.DiGraph)
# Find leaves (id3, id5, id7)
leaves = [node for node, degree in G.out_degree() if degree == 0]
# Enumerate all possible paths
for node in df['id']:
for leaf in leaves:
for path in nx.all_simple_paths(G, node, leaf):
data[node].append(path)
输出:
>>> data
defaultdict(list,
{'id1': [['id1', 'id3'],
['id1', 'id2', 'id5'],
['id1', 'id2', 'id4', 'id6', 'id7']],
'id2': [['id2', 'id5'], ['id2', 'id4', 'id6', 'id7']],
'id4': [['id4', 'id6', 'id7']],
'id6': [['id6', 'id7']]})
如果您想将字典合并到您的数据框:
df['replies'] = df['id'].map(data)
print(df)
# Output:
id text parent_id replies
0 id1 blah blah _ post _ [[id1, id3], [id1, id2, id5], [id1, id2, id4, ...
1 id2 blah blah id1 [[id2, id5], [id2, id4, id6, id7]]
2 id3 blah blah id1 []
3 id4 blah blah id2 [[id4, id6, id7]]
4 id5 blah blah id2 []
5 id6 blah blah id4 [[id6, id7]]
6 id7 blah blah id6 []
现在你可以分解你的数据框了:
df = df.explode('replies')
print(df)
# Output:
id text parent_id replies
0 id1 blah blah _ post _ [id1, id3]
0 id1 blah blah _ post _ [id1, id2, id5]
0 id1 blah blah _ post _ [id1, id2, id4, id6, id7]
1 id2 blah blah id1 [id2, id5]
1 id2 blah blah id1 [id2, id4, id6, id7]
2 id3 blah blah id1 NaN
3 id4 blah blah id2 [id4, id6, id7]
4 id5 blah blah id2 NaN
5 id6 blah blah id4 [id6, id7]
6 id7 blah blah id6 NaN