如何将 pandas 数据框转换为具有多对一关系的有序列表?

How can I turn pandas dataframe into an ordered list with many to one relationship?

我目前有一个 pandas 数据框,其中有许多关于单个问题的答案,所以我试图将它变成一个列表,以便我可以进行余弦相似度计算。

目前我有数据框,其中问题通过parent_id = q_id与答案连接在一起,如图所示:

many answers to one question dataframe

print (df)
   q_id      q_body  parent_id    a_body
0     1  question 1          1  answer 1
1     1  question 1          1  answer 2
2     1  question 1          1  answer 3
3     2  question 2          2  answer 1
4     2  question 2          2  answer 2

我要找的产品是:

("question 1", "answer 1", "answer 2", "answer 3")

("question 2", "answer 1", "answer 2")

如有任何帮助,我们将不胜感激!非常感谢你。

我认为你需要 groupbyapply:

#output is tuple with question value
df = df.groupby('q_body')['a_body'].apply(lambda x: tuple([x.name] + list(x)))
print (df)
q_body
question 1    (question 1, answer 1, answer 2, answer 3)
question 2              (question 2, answer 1, answer 2)
Name: a_body, dtype: object

#output is list with question value
df = df.groupby('q_body')['a_body'].apply(lambda x: [x.name] + list(x))
print (df)
q_body
question 1    [question 1, answer 1, answer 2, answer 3]
question 2              [question 2, answer 1, answer 2]
Name: a_body, dtype: object
#output is list without question value
df = df.groupby('q_body')['a_body'].apply(list)
print (df)
q_body
question 1    [answer 1, answer 2, answer 3]
question 2              [answer 1, answer 2]
Name: a_body, dtype: object

#grouping by parent_id without question value
df = df.groupby('parent_id')['a_body'].apply(list)
print (df)
parent_id
1    [answer 1, answer 2, answer 3]
2              [answer 1, answer 2]
Name: a_body, dtype: object

#output is string, values are concanecated by ,
df = df.groupby('parent_id')['a_body'].apply(', '.join)
print (df)
parent_id
1    answer 1, answer 2, answer 3
2              answer 1, answer 2
Name: a_body, dtype: object

但如果需要输出为列表添加 tolist:

L = df.groupby('q_body')['a_body'].apply(lambda x: tuple([x.name] + list(x))).tolist()
print (L)
[('question 1', 'answer 1', 'answer 2', 'answer 3'), ('question 2', 'answer 1', 'answer 2')]
df = pd.DataFrame([
        ['question 1', 'answer 1'],
        ['question 1', 'answer 2'],
        ['question 1', 'answer 3'],
        ['question 2', 'answer 1'],
        ['question 2', 'answer 2'],
    ], columns=['q_body', 'a_body'])

print(df)

       q_body    a_body
0  question 1  answer 1
1  question 1  answer 2
2  question 1  answer 3
3  question 2  answer 1
4  question 2  answer 2

apply(list)

df.groupby('q_body').a_body.apply(list)

q_body
question 1    [answer 1, answer 2, answer 3]
question 2              [answer 1, answer 2]

看看对你有没有帮助

result = df.groupby('q_id').agg({'q_body': lambda x: x.iloc[0], 'a_body': lambda x: ', '.join(x)})
result['output'] = result.q_body + ', ' + result.a_body                                                                                

这将创建一个新列 output,其中包含所需的结果。