这个热吗

Question

阅读：

http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html

它说 "encode categorical integer features using a one-hot aka one-of-K scheme."

这是否也意味着它一次性编码了一个单词列表？

来自维基百科定义（https://en.wikipedia.org/wiki/One-hot）的一种热编码
"In natural language processing, a one-hot vector is a 1 × N matrix (vector) used to distinguish each word in a vocabulary from every other word in the vocabulary. The vector consists of 0s in all cells with the exception of a single 1 in a cell used uniquely to identify the word."

运行下面的代码似乎 LabelEncoder 不是一种热编码的正确实现，而 OneHotEncoder 是正确的实现：

import numpy as np
from sklearn.preprocessing import MultiLabelBinarizer
from numpy import array
from numpy import argmax
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

# define example
data = ['w1 w2 w3', 'w1 w2']

values = array(data)
label_encoder = LabelEncoder()
integer_encoded = label_encoder.fit_transform(values)

# binary encode
onehot_encoder = OneHotEncoder(sparse=False)
integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)

mlb = MultiLabelBinarizer()

print('fit_transform\n' , mlb.fit_transform(data))
print('\none hot\n' , onehot_encoder.fit_transform(integer_encoded))

打印：

fit_transform
 [[1 1 1 1 1]
 [1 1 1 0 1]]

one hot
 [[0. 1.]
 [1. 0.]]

所以 LabelEncoder 不是单热编码，LabelEncoder 使用的编码类型是什么？

从上面的输出看来 OneHotEncoder 产生了比 LabelEncoder.

编码方案更密集的向量

更新：

如何决定使用 LabelEncoder 或 OneHotEncoder 为机器学习算法编码数据？

Answer 1

我觉得你的问题不够清楚...

首先，LabelEncoder 编码值介于 0 和 n_classes-1 之间的标签，而 OneHotEncoder 使用 one-hot aka one-of-K 方案编码分类整数特征.他们是不同的。

其次，是 OneHotEncoder 对单词列表进行编码。在维基百科的定义中，它表示 a one-hot vector is a 1 × N matrix。但是 N 是什么？实际上，N 就是你的词汇量。

例如，如果您有五个单词 a, b, c, d, e。然后对它们进行热编码：

a -> [1, 0, 0, 0, 0]  # a one-hot 1 x 5 vector
b -> [0, 1, 0, 0, 0]  # a one-hot 1 x 5 vector
c -> [0, 0, 1, 0, 0]  # a one-hot 1 x 5 vector
d -> [0, 0, 0, 1, 0]  # a one-hot 1 x 5 vector
e -> [0, 0, 0, 0, 1]  # a one-hot 1 x 5 vector
# total five one-hot 1 x 5 vectors which can be expressed in a 5 x 5 matrix.

第三，其实我不是100%确定你在问什么...

最后，回答你更新后的问题。大多数时候你应该选择 one-hot encoding 或者 word embedding。原因是 LabelEncoder 生成的向量太相似了，这意味着彼此之间没有太大区别。因为相似的输入更有可能产生相似的输出。这会使您的模型难以拟合。

这个热吗

Is this one hot

python

machine-learning

one-hot-encoding