如何处理二元分类问题的多标签分类特征？

Question

我有这样的数据集：

   profile     category  target
0        1      [5, 10]       1
1        2          [1]       0
2        3   [23, 5000]       1
3        4  [700, 4500]       0

如何处理类别功能，此 table 可能还有其他附加功能。一种热编码导致消耗过多 space.because 行数约为 1000 万。任何建议都会有所帮助。

Answer 1

我的想法是拆分这个数组 :

这将导致以下数据框：

   profile     0    1  target
0        1     5    10       1
1        2     1             0
2        3     23   5000       1
3        4     700  4500       0

下一步你可以调整它，根据，将类别获取特征（如果配置文件有这个类别则填充1），这将导致以下数据框：

   profile     1  ...  5  ... 10 ... 23 target
0        1     0       1       0      0      1
1        2     1       0       0      0      0
2        3     0       0       0      1      1
3        4     0       0       0      0      0

你会把每个类别作为一个特征，这可以帮助你（这类似于文本分类问题）。然后你可以使用一些技术来降维，比如 pca。

通过这种方法，您尊重类别行为，并且可以在以后使用一些数学技巧减少您的维度。

Answer 2

MultiLabelBinarizer 是这种问题的解决方案，它给出的稀疏输出内存不足，您可以将其他特征转换为稀疏矩阵，而不是将所有特征组合起来输入机器学习模型。

source

如何处理二元分类问题的多标签分类特征？

How to handle multi-label categorical feature for binary classification problem?

machine-learning

feature-extraction

data-science

feature-engineering