一个热编码列表列，包括额外的置信度小数

Question

我有一个 table 想要进行热编码。我可以使用 pandas get_dummies 或 sklearn MultiLabelBinarizer 来做到这一点，例如。正常的方法是这样的：

    categories              a   b   c   e
0   [a, b, c]           0   1   1   1   0
1   [c]          --->   1   0   0   1   0
2   [b, c, e]           2   0   1   1   1

但是，就我而言，我也对这样的类别充满信心。

    categories
0   [{a:0.3}, {b:0.4}, {c:0.5}]
1   [{c:0.8}]
2   [{b:1}, {c:1}, {e:0.1}]

我想将这些知识整合到我的决策树分类器中。 IE。我想以这种格式获取我的数据：

    a   b   c   e
0   0.3 0.4 0.5 0
1   0   0   0.8 0
2   0   1.0 1.0 0.1

我可以先构建普通的热编码 table，然后通过遍历所有行来更改值。不过，我希望有更简单的方法。

如何对上面的 table 进行热编码并合并类别置信度的附加信息？

Answer 1

对字典的扁平化值使用字典理解：

df = (pd.DataFrame([{k: v for d in x for k, v in d.items()} for x in df['categories']])
        .fillna(0))
print (df)
     a    b    c    e
0  0.3  0.4  0.5  0.0
1  0.0  0.0  0.8  0.0
2  0.0  1.0  1.0  0.1

一个热编码列表列，包括额外的置信度小数

One hot encode columns of lists including additional confidence decimal number

python

decision-tree

pandas

scikit-learn

one-hot-encoding