在 R 的 randomForest 包中，因子是否必须明确标记为因子？

In R's randomForest package, do factors have to be explicitly labeled as factors?

或者包会意识到它们不是连续的并将它们视为因素？我知道，对于分类，被分类的特征确实需要成为一个因素。但是预测功能呢？我在几个玩具数据集上运行它，根据分类特征是数字还是因子，我得到的结果略有不同，但算法是随机的，所以我不知道我的结果是否不同有意义。

谢谢！

是的，两者之间是有区别的。如果你想使用因子变量，你应该这样指定它而不是将它保留为数字。

对于分类数据（this 实际上是 CrossValidated 上一个很好的答案）：

A split on a factor with N levels is actually a selection of one of the (2^N)−2 possible combinations. So, the algorithm will check all the possible combinations and choose the one that produces the better split

对于数值数据（见here）：

Numerical predictors are sorted then for every value Gini impurity or entropy is calculated and a threshold is chosen which gives the best split.

所以是的，将其添加为一个因子还是一个数字变量会有所不同。差多少要看实际数据。

在 R 的 randomForest 包中，因子是否必须明确标记为因子？

In R's randomForest package, do factors have to be explicitly labeled as factors?

statistics

r

factors

random-forest