标签编码 n 维分类值

Label Encoding n-dimensional categorical values

我看到这篇文章 Label encoding across multiple columns in scikit-learn and one of the comments 解释了给定列的每个值如何从 0 到 (n-1) 的范围进行编码，其中 n 是列的长度。它提出了一个问题，当我编码 red: 2、orange: 1 和 green: 0 是否意味着绿色比红色更接近橙色，因为 0 比 2 更接近 1；实际上哪个不是真的？我早些时候想也许因为 green 出现了最大次数，它得到的值是 0。但是，这不适用于 fruit 列，其中 apple gets value 0 即使 orange occurs the maximum number of times.

我想总结一下 Label Encoder 和 One Hot Encoding：

确实，Label Encoder 只是简单地给出了单元格值的整数表示。这意味着对于上述数据集，如果我们标记编码我们的分类值 - 它会 imply that green is closer to orange than red since 0 is closer to 1 than 2 - 这是错误的。

另一方面，One Hot Encoding 为每个分类值创建一个单独的列，并给出 0 或 1 的值，分别表示该特征的不存在或存在。此外，pd.get_dummies(dataframe) 的内置函数会产生相同的输出。

因此，如果给定的数据集包含本质上是有序的分类值，那么使用 Label Encoding 是明智的；但如果给定的数据是名义上的，则应继续 One Hot Encoding。

https://discuss.analyticsvidhya.com/t/dummy-variables-is-necessary-to-standardize-them/66867/2

标签编码 n 维分类值

Label Encoding n-dimensional categorical values

python

encoding

encode

categorical-data