通过对行进行分组来填充 pandas 中的矩阵
Fill matrix in pandas by grouping rows
我从数据库中提取了 table 并希望对某些条目进行一些主题分析。我创建了一个具有唯一主题名称的空矩阵,并且我有重复的行,因为每个 'name' 条目可能关联多个主题。最终,我想要一个数据框,该数据框在与主题相关联的行中具有 1。然后我将删除 'topic label' 列,并在某个时候删除重复的行。实际数据框要大得多,但这里我只是展示一个例子。
这是我的数据:
topic_label name Misconceptions Long-term health issues Reproductive disease Inadequate research Unconscious bias
0 Misconceptions When is menstrual bleeding too much? 0 0 0 0 0
1 Long-term health issues When is menstrual bleeding too much? 0 0 0 0 0
2 Reproductive disease 10% of reproductive age women have endometriosis 0 0 0 0 0
3 Inadequate research 10% of reproductive age women have endometriosis 0 0 0 0 0
4 Unconscious bias Male bias threatens women's health 0 0 0 0 0
我希望它看起来像这样:
topic_label name Misconceptions Long-term health issues Reproductive disease Inadequate research Unconscious bias
0 Misconceptions When is menstrual bleeding too much? 1 1 0 0 0
1 Long-term health issues When is menstrual bleeding too much? 1 1 0 0 0
2 Reproductive disease 10% of reproductive age women have endometriosis 0 0 1 1 0
3 Inadequate research 10% of reproductive age women have endometriosis 0 0 1 1 0
4 Unconscious bias Male bias threatens women's health 0 0 0 0 1
我尝试在循环中使用 .loc
首先按名称对数据进行切片,然后分配值(在将名称设置为索引之后),但是当一行是唯一的时,这不起作用:
name_set = list(set(df['name']))
df = df.set_index('name')
for i in name_set:
df.loc[i, list(df.loc[i]['topic_label'])] = 1
我觉得我在这里兜圈子...有更好的方法吗?
一种选择是对每个 topic_label
的虚拟变量使用 get_dummies
;然后在 groupby.transform
中调用 sum
来聚合名称的虚拟变量:
cols = df['topic_label'].tolist()
out = df.drop(columns=cols).join(pd.get_dummies(df['topic_label']).groupby(df['name']).transform('sum').reindex(df['topic_label'], axis=1))
print(out)
上面returns一个新的DataFrameout
。如果您想改为更新 df
,则可以使用 update
:
df.update(pd.get_dummies(df['topic_label']).groupby(df['name']).transform('sum').reindex(df['topic_label'], axis=1))
print(df)
输出:
topic_label name Misconceptions Long-term health issues Reproductive disease Inadequate research Unconscious bias
0 Misconceptions When is menstrual bleeding too much? 1 1 0 0 0
1 Long-term health issues When is menstrual bleeding too much? 1 1 0 0 0
2 Reproductive disease 10% of reproductive age women have endometriosis 0 0 1 1 0
3 Inadequate research 10% of reproductive age women have endometriosis 0 0 1 1 0
4 Unconscious bias Male bias threatens women's health 0 0 0 0 1
我从数据库中提取了 table 并希望对某些条目进行一些主题分析。我创建了一个具有唯一主题名称的空矩阵,并且我有重复的行,因为每个 'name' 条目可能关联多个主题。最终,我想要一个数据框,该数据框在与主题相关联的行中具有 1。然后我将删除 'topic label' 列,并在某个时候删除重复的行。实际数据框要大得多,但这里我只是展示一个例子。
这是我的数据:
topic_label name Misconceptions Long-term health issues Reproductive disease Inadequate research Unconscious bias
0 Misconceptions When is menstrual bleeding too much? 0 0 0 0 0
1 Long-term health issues When is menstrual bleeding too much? 0 0 0 0 0
2 Reproductive disease 10% of reproductive age women have endometriosis 0 0 0 0 0
3 Inadequate research 10% of reproductive age women have endometriosis 0 0 0 0 0
4 Unconscious bias Male bias threatens women's health 0 0 0 0 0
我希望它看起来像这样:
topic_label name Misconceptions Long-term health issues Reproductive disease Inadequate research Unconscious bias
0 Misconceptions When is menstrual bleeding too much? 1 1 0 0 0
1 Long-term health issues When is menstrual bleeding too much? 1 1 0 0 0
2 Reproductive disease 10% of reproductive age women have endometriosis 0 0 1 1 0
3 Inadequate research 10% of reproductive age women have endometriosis 0 0 1 1 0
4 Unconscious bias Male bias threatens women's health 0 0 0 0 1
我尝试在循环中使用 .loc
首先按名称对数据进行切片,然后分配值(在将名称设置为索引之后),但是当一行是唯一的时,这不起作用:
name_set = list(set(df['name']))
df = df.set_index('name')
for i in name_set:
df.loc[i, list(df.loc[i]['topic_label'])] = 1
我觉得我在这里兜圈子...有更好的方法吗?
一种选择是对每个 topic_label
的虚拟变量使用 get_dummies
;然后在 groupby.transform
中调用 sum
来聚合名称的虚拟变量:
cols = df['topic_label'].tolist()
out = df.drop(columns=cols).join(pd.get_dummies(df['topic_label']).groupby(df['name']).transform('sum').reindex(df['topic_label'], axis=1))
print(out)
上面returns一个新的DataFrameout
。如果您想改为更新 df
,则可以使用 update
:
df.update(pd.get_dummies(df['topic_label']).groupby(df['name']).transform('sum').reindex(df['topic_label'], axis=1))
print(df)
输出:
topic_label name Misconceptions Long-term health issues Reproductive disease Inadequate research Unconscious bias
0 Misconceptions When is menstrual bleeding too much? 1 1 0 0 0
1 Long-term health issues When is menstrual bleeding too much? 1 1 0 0 0
2 Reproductive disease 10% of reproductive age women have endometriosis 0 0 1 1 0
3 Inadequate research 10% of reproductive age women have endometriosis 0 0 1 1 0
4 Unconscious bias Male bias threatens women's health 0 0 0 0 1