RandomForestRegressor 特征是否作为类别处理？

Question

我正在为我的项目使用 RandomForestRegressor（来自 python 中很棒的 Scikt-Learn 库），它给了我很好的结果，但我认为我可以做得更好。当我为 'fit(..)' 功能提供功能时，将分类特征作为二元特征更好吗？

示例：而不是：

===========
continent |
===========
     1    |
===========
     2    |
===========
     3    |
===========
     2    |
===========

制作类似：

===========================
is_europe | is_asia   | ...
===========================
    1     |     0     |
===========================
    0     |     1     |
===========================

因为它像树一样工作，也许第二个选项更好，或者它对第一个选项是否同样有效？非常感谢！

Answer 1

强烈建议对分类变量进行二值化，预计其性能优于没有二值化器变换的模型。如果 scikit-learn 将 continent = [1, 2, 3, 2] 视为数值（连续变量 [定量] 而不是分类 [定性]），它会对该特征施加人为的顺序约束。例如，假设continent=1表示is_europe，continent=2表示is_asia，continent=3表示is_america，则表示is_asia在检查 continent feature 与您的响应变量 y 的关系时总是在 is_europe 和 is_america 之间，这不一定是正确的并且有机会降低模型的有效性.相比之下，将其设为虚拟变量则没有这样的问题，scikit-learn 将分别处理每个二进制特征。

要对 scikit-learn 中的分类变量进行二值化，您可以使用 LabelBinarizer.

from sklearn.preprocessing import LabelBinarizer


# your data
# ===========================
continent = [1, 2, 3, 2]
continent_dict = {1:'is_europe', 2:'is_asia', 3:'is_america'}
print(continent_dict)

{1: 'is_europe', 2: 'is_asia', 3: 'is_america'}

# processing
# =============================
binarizer = LabelBinarizer()
# fit on the categorical feature
continent_dummy = binarizer.fit_transform(continent)
print(continent_dummy)

[[1 0 0]
 [0 1 0]
 [0 0 1]
 [0 1 0]]

如果您在 pandas 中处理数据，那么它的顶级函数 pandas.get_dummies 也有帮助。

RandomForestRegressor 特征是否作为类别处理？

Are RandomForestRegressor features handles as categories?

python

tree

machine-learning

random-forest

scikit-learn