LabelEncoding() vs OneHotEncoding() (sklearn,pandas) 建议

Question

我的数据框中有 3 种类型的分类数据，df。

df['Vehicles Owned'] = [1,2,3+,2,1,2,3+,2]
df['Sex'] = ['m','m','f','m','f','f','m','m']
df['Income'] = [42424,65326,54652,9463,9495,24685,52536,23535]

我应该为 df['Vehicles Owned'] 做什么？（一个热编码，labelencode 或通过将 3+ 转换为整数而保持原样。我按原样使用整数值。按顺序寻找建议）

对于df['Sex']，我是应该labelEncode还是One hot？（因为没有顺序，所以我用的是One Hot Encoding）

df['Income'] 有很多变化。所以我应该把它转换成 bin 并使用 One Hot Encoding 来解释 low、medium、high 收入吗？

Answer 1

我会推荐：

对于sex，one-hot encode，转换为使用单个布尔值 is_female 或 is_male 的变量；对于 n 个类别，您需要 n-1 one-hot-encoded 变量因为第 n 个线性依赖于第 n-1 个。
对于vehicles_owned如果你想保留顺序，我会重新映射你的变量从 [1,2,3,3+] 到 [1,2,3,4] 并视为一个 int 变量，或 [1,2,3,3.5] 作为浮点变量。
对于 income：您可能应该将其保留为浮点变量。某些型号（如 GBT 型号）可能会进行某种装箱在引擎盖下。如果你的收入数据恰好是一个指数分发，您可以尝试 loging 它。但只是将其转换为我不推荐你自己的特征工程中的垃圾箱。

所有这些事情的元建议是建立一个你有信心的交叉验证方案，为你所有的特征工程决策尝试不同的公式，然后按照你的交叉验证性能度量来做出你的最终决定。

最后，在这两者之间使用 library/function 我更喜欢 pandas' get_dummies 因为它允许您在最终的特征矩阵中保留列名信息，如下所示：

LabelEncoding() vs OneHotEncoding() (sklearn,pandas) 建议

LabelEncoding() vs OneHotEncoding() (sklearn,pandas) suggestions

python

machine-learning

pandas

scikit-learn

feature-engineering