在 sklearn 随机森林分类器中设置随机状态是否会使您的模型产生偏差?

Does setting a random state in sklearn's RandomForestClassifier bias your model?

我训练了一个随机森林模型并使用了一致的 random_state 值。我的训练、测试和验证数据集也获得了非常好的准确度(都在 ~.98 左右)。尽管少数 class 只占数据集的 ~10%。

如果您有兴趣,这里有一些代码:

model = RandomForestClassifier(n_jobs=-1, oob_score=True, random_state=310, n_estimators=300)
model.fit(subset, train.iloc[:,-1])

考虑到训练、验证和测试数据集的准确性得分很高,random_state 会影响我的模型的泛化吗?

random_state : int, RandomState instance or None, optional (default=None)

If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

一般来说random_state用于初始设置内部参数,所以你可以确定性地重复训练。现在您可以更改其他超参数(例如树的数量)来比较结果。

缺点可能是您找不到全局最优值。但是您的结果听起来确实不错,准确度为 0.98.

random_state不影响模型的泛化。事实上,当您调整超参数(例如 n_estimatorsdepth 等)时,最好为 random_state 设置相同的值。这将确保您的性能不受随机初始状态。

此外,当您拥有不平衡的数据集时,准确性不是衡量模型性能的推荐指标。

ROC 或 PR 曲线下的面积可能是您可以使用的为数不多的最佳指标之一,但有很多指标可用。参见 here

The problem of learning an optimal decision tree is known to be NP-complete under several aspects of optimality and even for simple concepts. Consequently, practical decision-tree learning algorithms are based on heuristic algorithms such as the greedy algorithm where locally optimal decisions are made at each node. Such algorithms cannot guarantee to return the globally optimal decision tree. This can be mitigated by training multiple trees in an ensemble learner, where the features and samples are randomly sampled with replacement.

random_state用于随机选择子特征,小于总特征,子样本。此参数控制随机选择。