如何将 pandas 数据框转换为具有多对一关系的有序列表?
How can I turn pandas dataframe into an ordered list with many to one relationship?
我目前有一个 pandas 数据框,其中有许多关于单个问题的答案,所以我试图将它变成一个列表,以便我可以进行余弦相似度计算。
目前我有数据框,其中问题通过parent_id = q_id与答案连接在一起,如图所示:
many answers to one question dataframe
print (df)
q_id q_body parent_id a_body
0 1 question 1 1 answer 1
1 1 question 1 1 answer 2
2 1 question 1 1 answer 3
3 2 question 2 2 answer 1
4 2 question 2 2 answer 2
我要找的产品是:
("question 1", "answer 1", "answer 2", "answer 3")
("question 2", "answer 1", "answer 2")
如有任何帮助,我们将不胜感激!非常感谢你。
我认为你需要 groupby
和 apply
:
#output is tuple with question value
df = df.groupby('q_body')['a_body'].apply(lambda x: tuple([x.name] + list(x)))
print (df)
q_body
question 1 (question 1, answer 1, answer 2, answer 3)
question 2 (question 2, answer 1, answer 2)
Name: a_body, dtype: object
#output is list with question value
df = df.groupby('q_body')['a_body'].apply(lambda x: [x.name] + list(x))
print (df)
q_body
question 1 [question 1, answer 1, answer 2, answer 3]
question 2 [question 2, answer 1, answer 2]
Name: a_body, dtype: object
#output is list without question value
df = df.groupby('q_body')['a_body'].apply(list)
print (df)
q_body
question 1 [answer 1, answer 2, answer 3]
question 2 [answer 1, answer 2]
Name: a_body, dtype: object
#grouping by parent_id without question value
df = df.groupby('parent_id')['a_body'].apply(list)
print (df)
parent_id
1 [answer 1, answer 2, answer 3]
2 [answer 1, answer 2]
Name: a_body, dtype: object
#output is string, values are concanecated by ,
df = df.groupby('parent_id')['a_body'].apply(', '.join)
print (df)
parent_id
1 answer 1, answer 2, answer 3
2 answer 1, answer 2
Name: a_body, dtype: object
但如果需要输出为列表添加 tolist
:
L = df.groupby('q_body')['a_body'].apply(lambda x: tuple([x.name] + list(x))).tolist()
print (L)
[('question 1', 'answer 1', 'answer 2', 'answer 3'), ('question 2', 'answer 1', 'answer 2')]
df = pd.DataFrame([
['question 1', 'answer 1'],
['question 1', 'answer 2'],
['question 1', 'answer 3'],
['question 2', 'answer 1'],
['question 2', 'answer 2'],
], columns=['q_body', 'a_body'])
print(df)
q_body a_body
0 question 1 answer 1
1 question 1 answer 2
2 question 1 answer 3
3 question 2 answer 1
4 question 2 answer 2
apply(list)
df.groupby('q_body').a_body.apply(list)
q_body
question 1 [answer 1, answer 2, answer 3]
question 2 [answer 1, answer 2]
看看对你有没有帮助
result = df.groupby('q_id').agg({'q_body': lambda x: x.iloc[0], 'a_body': lambda x: ', '.join(x)})
result['output'] = result.q_body + ', ' + result.a_body
这将创建一个新列 output,其中包含所需的结果。
我目前有一个 pandas 数据框,其中有许多关于单个问题的答案,所以我试图将它变成一个列表,以便我可以进行余弦相似度计算。
目前我有数据框,其中问题通过parent_id = q_id与答案连接在一起,如图所示:
many answers to one question dataframe
print (df)
q_id q_body parent_id a_body
0 1 question 1 1 answer 1
1 1 question 1 1 answer 2
2 1 question 1 1 answer 3
3 2 question 2 2 answer 1
4 2 question 2 2 answer 2
我要找的产品是:
("question 1", "answer 1", "answer 2", "answer 3")
("question 2", "answer 1", "answer 2")
如有任何帮助,我们将不胜感激!非常感谢你。
我认为你需要 groupby
和 apply
:
#output is tuple with question value
df = df.groupby('q_body')['a_body'].apply(lambda x: tuple([x.name] + list(x)))
print (df)
q_body
question 1 (question 1, answer 1, answer 2, answer 3)
question 2 (question 2, answer 1, answer 2)
Name: a_body, dtype: object
#output is list with question value
df = df.groupby('q_body')['a_body'].apply(lambda x: [x.name] + list(x))
print (df)
q_body
question 1 [question 1, answer 1, answer 2, answer 3]
question 2 [question 2, answer 1, answer 2]
Name: a_body, dtype: object
#output is list without question value
df = df.groupby('q_body')['a_body'].apply(list)
print (df)
q_body
question 1 [answer 1, answer 2, answer 3]
question 2 [answer 1, answer 2]
Name: a_body, dtype: object
#grouping by parent_id without question value
df = df.groupby('parent_id')['a_body'].apply(list)
print (df)
parent_id
1 [answer 1, answer 2, answer 3]
2 [answer 1, answer 2]
Name: a_body, dtype: object
#output is string, values are concanecated by ,
df = df.groupby('parent_id')['a_body'].apply(', '.join)
print (df)
parent_id
1 answer 1, answer 2, answer 3
2 answer 1, answer 2
Name: a_body, dtype: object
但如果需要输出为列表添加 tolist
:
L = df.groupby('q_body')['a_body'].apply(lambda x: tuple([x.name] + list(x))).tolist()
print (L)
[('question 1', 'answer 1', 'answer 2', 'answer 3'), ('question 2', 'answer 1', 'answer 2')]
df = pd.DataFrame([
['question 1', 'answer 1'],
['question 1', 'answer 2'],
['question 1', 'answer 3'],
['question 2', 'answer 1'],
['question 2', 'answer 2'],
], columns=['q_body', 'a_body'])
print(df)
q_body a_body
0 question 1 answer 1
1 question 1 answer 2
2 question 1 answer 3
3 question 2 answer 1
4 question 2 answer 2
apply(list)
df.groupby('q_body').a_body.apply(list)
q_body
question 1 [answer 1, answer 2, answer 3]
question 2 [answer 1, answer 2]
看看对你有没有帮助
result = df.groupby('q_id').agg({'q_body': lambda x: x.iloc[0], 'a_body': lambda x: ', '.join(x)})
result['output'] = result.q_body + ', ' + result.a_body
这将创建一个新列 output,其中包含所需的结果。