如何从回复 ID (Python) 中获取 threads/conversations?

How to get threads/conversations from reply ids (Python)?

我是 python 的相对新手,我正在尝试从具有 ID 列表的数据框中重建 conversations/threads。

我目前有一个 pandas 推文/reddit 帖子的数据框,其格式大致如下:

id text parent_id replies
id1 blah blah _ post _ id2, id3, id4, id5, id6, id7
id2 blah blah id1 id4, id5, id6, id7
id3 blah blah id1
id4 blah blah id2 id6, id7
id5 blah blah id2
id6 blah blah id4 id7
id7 blah blah id6

我的目标是根据 ID 将数据分成 threads/conversations。这意味着,从上面的例子中,得到以下序列作为输出:

[id1, id2, id4, id6],

[id1, id2, id4, id7],

[id1, id2, id5], &

[id1, id3].

拥有这些列表将使我能够完整地查看线程。目前我的代码非常复杂,看起来像这样:

out_list = []
for i, row in df.iterrows():
    id_ = row["id"]
    # create our output file 
    sequence = [id_]
    replies = list(row['replies'])
    # creates a new dataframe from the replies to the topline comment in question
    reply_df= df.loc[df['id'].isin(replies)]
    reply_df = reply_df[reply_df.Parent_id2 == id_]
    #check if ends at topline
    if reply_df.empty == False:
        
        def turn_recursion(df, reply_df):
            for j, row_ in reply_df.iterrows():
                replies_2 = reply_df.loc[j, 'replies']
                id_2 = row_["id"]

                reply_df2 =  df.loc[df['id'].isin(replies_2)]
                reply_df2 = reply_df2[reply_df2.Parent_id2 == id_2]

                nonlocal sequence
                nonlocal out_list
                            
                if reply_df2.empty == False:
                    sequence.append(id_2)
                    return(turn_recursion(df, reply_df2))
                
                else:
                    sequence.append(id_2)
                    out_list.append(sequence)
        
        turn_recursion(test2, reply_df)
    else:
        out_list.append(sequence)
    

这目前给我的结果是半准确的,但我得到的不是:[[id1, id2, id4, id6],[id1, id2, id4, id7]],而是:[id1, id2, id4 , id6, id7].

我意识到我可能有点昏昏欲睡,有一个简单的解决方案,但就我的生活而言,我似乎无法找到一种方法来做到这一点,以便它正常工作并且任意螺纹长度。

提前感谢您提出任何建议。 :)

使用networkx实现你想要的:

import pandas as pd
import networkx as nx
from collections import defaultdict

data = defaultdict(list)

# Build graph from pandas
G = nx.from_pandas_edgelist(df, source='parent_id', target='id', 
                            create_using=nx.DiGraph)

# Find leaves (id3, id5, id7)
leaves = [node for node, degree in G.out_degree() if degree == 0]

# Enumerate all possible paths
for node in df['id']:
    for leaf in leaves:
        for path in nx.all_simple_paths(G, node, leaf):
            data[node].append(path)

输出:

>>> data
defaultdict(list,
            {'id1': [['id1', 'id3'],
              ['id1', 'id2', 'id5'],
              ['id1', 'id2', 'id4', 'id6', 'id7']],
             'id2': [['id2', 'id5'], ['id2', 'id4', 'id6', 'id7']],
             'id4': [['id4', 'id6', 'id7']],
             'id6': [['id6', 'id7']]})

如果您想将字典合并到您的数据框:

df['replies'] = df['id'].map(data)
print(df)

# Output:
    id       text parent_id                                            replies
0  id1  blah blah  _ post _  [[id1, id3], [id1, id2, id5], [id1, id2, id4, ...
1  id2  blah blah       id1                 [[id2, id5], [id2, id4, id6, id7]]
2  id3  blah blah       id1                                                 []
3  id4  blah blah       id2                                  [[id4, id6, id7]]
4  id5  blah blah       id2                                                 []
5  id6  blah blah       id4                                       [[id6, id7]]
6  id7  blah blah       id6                                                 []

现在你可以分解你的数据框了:

df = df.explode('replies')
print(df)

# Output:
    id       text parent_id                    replies
0  id1  blah blah  _ post _                 [id1, id3]
0  id1  blah blah  _ post _            [id1, id2, id5]
0  id1  blah blah  _ post _  [id1, id2, id4, id6, id7]
1  id2  blah blah       id1                 [id2, id5]
1  id2  blah blah       id1       [id2, id4, id6, id7]
2  id3  blah blah       id1                        NaN
3  id4  blah blah       id2            [id4, id6, id7]
4  id5  blah blah       id2                        NaN
5  id6  blah blah       id4                 [id6, id7]
6  id7  blah blah       id6                        NaN