编码分类列 - 标签编码与决策树的一种热编码
Encoding categorical columns - Label encoding vs one hot encoding for Decision trees
决策树和随机森林使用拆分逻辑的方式,我的印象是标签编码对于这些模型来说不是问题,因为我们无论如何都要拆分列。例如:如果我们将性别设置为 'male'、'female' 和 'other',使用标签编码,它变为 0,1,2,这被解释为 0<1<2。但是因为我们要拆分列,所以我认为这并不重要,因为我们要拆分 'male' 还是 '0' 是一回事。但是当我在数据集上同时尝试标签和一种热编码时,一种热编码提供了更好的准确性和精度。
能否请您分享您的想法。
The ACCURACY SCORE of various models on train and test are:
The accuracy score of simple decision tree on label encoded data : TRAIN: 86.46% TEST: 79.42%
The accuracy score of tuned decision tree on label encoded data : TRAIN: 81.74% TEST: 81.33%
The accuracy score of random forest ensembler on label encoded data: TRAIN: 82.26% TEST: 81.63%
The accuracy score of simple decision tree on one hot encoded data : TRAIN: 86.46% TEST: 79.74%
The accuracy score of tuned decision tree on one hot encoded data : TRAIN: 82.04% TEST: 81.46%
The accuracy score of random forest ensembler on one hot encoded data:TRAIN: 82.41% TEST: 81.66%
he PRECISION SCORE of various models on train and test are:
The precision score of simple decision tree on label encoded data : TRAIN: 78.26% TEST: 57.92%
The precision score of tuned decision tree on label encoded data : hTRAIN: 66.54% TEST: 64.6%
The precision score of random forest ensembler on label encoded data: TRAIN: 70.1% TEST: 67.44%
The precision score of simple decision tree on one hot encoded data : TRAIN: 78.26% TEST: 58.84%
The precision score of tuned decision tree on one hot encoded data : TRAIN: 68.06% TEST: 65.81%
The precision score of random forest ensembler on one hot encoded data: TRAIN: 70.34% TEST: 67.32%
您可以将其视为正则化效果:您的模型更简单,因此更具泛化性。所以你会得到更好的表现。
以您的性别特征为例:[male, female, other] 标签编码变为 [0, 1, 2]。
现在假设有一个仅对女性有效的其他特征的特定配置:树需要两个分支给 select 女性,一个 select 性别大于零,另一个select性别低于2.
相反,使用单热编码,您只需要一个分支来执行 selection,比如 sex_female 大于零。
决策树和随机森林使用拆分逻辑的方式,我的印象是标签编码对于这些模型来说不是问题,因为我们无论如何都要拆分列。例如:如果我们将性别设置为 'male'、'female' 和 'other',使用标签编码,它变为 0,1,2,这被解释为 0<1<2。但是因为我们要拆分列,所以我认为这并不重要,因为我们要拆分 'male' 还是 '0' 是一回事。但是当我在数据集上同时尝试标签和一种热编码时,一种热编码提供了更好的准确性和精度。 能否请您分享您的想法。
The ACCURACY SCORE of various models on train and test are:
The accuracy score of simple decision tree on label encoded data : TRAIN: 86.46% TEST: 79.42%
The accuracy score of tuned decision tree on label encoded data : TRAIN: 81.74% TEST: 81.33%
The accuracy score of random forest ensembler on label encoded data: TRAIN: 82.26% TEST: 81.63%
The accuracy score of simple decision tree on one hot encoded data : TRAIN: 86.46% TEST: 79.74%
The accuracy score of tuned decision tree on one hot encoded data : TRAIN: 82.04% TEST: 81.46%
The accuracy score of random forest ensembler on one hot encoded data:TRAIN: 82.41% TEST: 81.66%
he PRECISION SCORE of various models on train and test are:
The precision score of simple decision tree on label encoded data : TRAIN: 78.26% TEST: 57.92%
The precision score of tuned decision tree on label encoded data : hTRAIN: 66.54% TEST: 64.6%
The precision score of random forest ensembler on label encoded data: TRAIN: 70.1% TEST: 67.44%
The precision score of simple decision tree on one hot encoded data : TRAIN: 78.26% TEST: 58.84%
The precision score of tuned decision tree on one hot encoded data : TRAIN: 68.06% TEST: 65.81%
The precision score of random forest ensembler on one hot encoded data: TRAIN: 70.34% TEST: 67.32%
您可以将其视为正则化效果:您的模型更简单,因此更具泛化性。所以你会得到更好的表现。
以您的性别特征为例:[male, female, other] 标签编码变为 [0, 1, 2]。
现在假设有一个仅对女性有效的其他特征的特定配置:树需要两个分支给 select 女性,一个 select 性别大于零,另一个select性别低于2.
相反,使用单热编码,您只需要一个分支来执行 selection,比如 sex_female 大于零。