通过对行进行分组来填充 pandas 中的矩阵

Fill matrix in pandas by grouping rows

我从数据库中提取了 table 并希望对某些条目进行一些主题分析。我创建了一个具有唯一主题名称的空矩阵,并且我有重复的行,因为每个 'name' 条目可能关联多个主题。最终,我想要一个数据框,该数据框在与主题相关联的行中具有 1。然后我将删除 'topic label' 列,并在某个时候删除重复的行。实际数据框要大得多,但这里我只是展示一个例子。

这是我的数据:

    topic_label               name                                              Misconceptions  Long-term health issues Reproductive disease    Inadequate research Unconscious bias
0   Misconceptions            When is menstrual bleeding too much?              0   0   0   0   0
1   Long-term health issues   When is menstrual bleeding too much?              0   0   0   0   0
2   Reproductive disease      10% of reproductive age women have endometriosis  0   0   0   0   0
3   Inadequate research       10% of reproductive age women have endometriosis  0   0   0   0   0
4   Unconscious bias          Male bias threatens women's health                0   0   0   0   0

我希望它看起来像这样:

    topic_label               name                                              Misconceptions  Long-term health issues Reproductive disease    Inadequate research Unconscious bias
0   Misconceptions            When is menstrual bleeding too much?              1   1   0   0   0
1   Long-term health issues   When is menstrual bleeding too much?              1   1   0   0   0
2   Reproductive disease      10% of reproductive age women have endometriosis  0   0   1   1   0
3   Inadequate research       10% of reproductive age women have endometriosis  0   0   1   1   0
4   Unconscious bias          Male bias threatens women's health                0   0   0   0   1

我尝试在循环中使用 .loc 首先按名称对数据进行切片,然后分配值(在将名称设置为索引之后),但是当一行是唯一的时,这不起作用:

name_set = list(set(df['name']))
df = df.set_index('name')

for i in name_set:
    df.loc[i, list(df.loc[i]['topic_label'])] = 1

我觉得我在这里兜圈子...有更好的方法吗?

一种选择是对每个 topic_label 的虚拟变量使用 get_dummies;然后在 groupby.transform 中调用 sum 来聚合名称的虚拟变量:

cols = df['topic_label'].tolist()
out = df.drop(columns=cols).join(pd.get_dummies(df['topic_label']).groupby(df['name']).transform('sum').reindex(df['topic_label'], axis=1))
print(out)

上面returns一个新的DataFrameout。如果您想改为更新 df,则可以使用 update:

df.update(pd.get_dummies(df['topic_label']).groupby(df['name']).transform('sum').reindex(df['topic_label'], axis=1))
print(df)

输出:

               topic_label                                              name  Misconceptions  Long-term health issues  Reproductive disease  Inadequate research  Unconscious bias
0           Misconceptions              When is menstrual bleeding too much?               1                        1                     0                    0                 0
1  Long-term health issues              When is menstrual bleeding too much?               1                        1                     0                    0                 0
2     Reproductive disease  10% of reproductive age women have endometriosis               0                        0                     1                    1                 0
3      Inadequate research  10% of reproductive age women have endometriosis               0                        0                     1                    1                 0
4         Unconscious bias                Male bias threatens women's health               0                        0                     0                    0                 1