如果我的测试数据在列中有缺失值，我该如何解析一个热编码？

Question

例如，如果我的训练数据在列中具有分类值 (1,2,3,4,5)，那么一个热编码将给我 5 个列。但是在我拥有的测试数据中，只说 5 个值中的 4 个，即（1,3,4,5）。所以如果我在测试中应用我训练过的权重，一个热编码只会给我 4 cols.Therefore数据，我会得到一个错误，因为 cols 的维度在训练和测试数据中不匹配，dim(4)!=dim(5)。关于如何处理缺失的 col 值有什么建议吗？我的代码图片如下：

image

Answer 1

您可以先合并两个数据帧，然后 get_dummies 然后拆分它们，这样它们就可以有准确的列数，即

#Example Dataframes 
Xtrain = pd.DataFrame({'x':np.array([4,2,3,5,3,1])})
Xtest = pd.DataFrame({'x':np.array([4,5,1,3])})


# Concat with keys then get dummies
temp = pd.get_dummies(pd.concat([Xtrain,Xtest],keys=[0,1]), columns=['x'])

# Selecting data from multi index and assigning them i.e
Xtrain,Xtest = temp.xs(0),temp.xs(1)

# Xtrain.as_matrix()
# array([[0, 0, 0, 1, 0],
#        [0, 1, 0, 0, 0],
#        [0, 0, 1, 0, 0],
#        [0, 0, 0, 0, 1],
#        [0, 0, 1, 0, 0],
#        [1, 0, 0, 0, 0]], dtype=uint8)

# Xtest.as_matrix()

# array([[0, 0, 0, 1, 0],
#        [0, 0, 0, 0, 1],
#        [1, 0, 0, 0, 0],
#        [0, 0, 1, 0, 0]], dtype=uint8)

不要遵循这种方法。这是一个有很多缺点的简单技巧。 @Vast院士的回答解释得更好。

Answer 2

请大家不要犯这个错误！

是的，您可以通过串联训练和测试来欺骗自己，但真正的问题是在生产中。你的模型总有一天会面临未知水平的分类变量，然后崩溃。

实际上，一些更可行的选择可能是：

定期重新训练您的模型以考虑新数据。
不要用一热。说真的，有很多更好的选择，比如留一编码 (https://www.kaggle.com/c/caterpillar-tube-pricing/discussion/15748#143154) conditional probability encoding (https://medium.com/airbnb-engineering/designing-machine-learning-models-7d0048249e69)、目标编码等等。像 CatBoost 这样的一些分类器甚至有一个内置的编码机制，Python 中有像 target_encoders 这样的成熟库，你会在其中找到很多其他选项。
嵌入分类特征，这可以使您免于完全重新训练 (http://flovv.github.io/Embeddings_with_keras/)

如果我的测试数据在列中有缺失值，我该如何解析一个热编码？

How do I resolve one hot encoding if my test data has missing values in a col?

numpy

machine-learning

pandas

one-hot-encoding