K次交叉验证每次都有不同的结果
K cross validation with different results everytime
我所有的模型都用下面的初始化:
def intiailize_clf_models(self):
model = RandomForestClassifier(random_state=42)
self.clf_models.append((model))
model = ExtraTreesClassifier(random_state=42)
self.clf_models.append((model))
model = MLPClassifier(random_state=42)
self.clf_models.append((model))
model = LogisticRegression(random_state=42)
self.clf_models.append((model))
model = xgb.XGBClassifier(random_state=42)
self.clf_models.append((model))
model = lgb.LGBMClassifier(random_state=42)
self.clf_models.append((model))
循环遍历模型并执行 k 折交叉验证:
def kfold_cross_validation(self):
clf_models = self.get_models()
models = []
self.results = {}
for model in clf_models:
self.current_model_name = model.__class__.__name__
cross_validate = cross_val_score(model, self.xtrain, self.ytrain, cv=4)
self.mean_cross_validation_score = cross_validate.mean()
print("Kfold cross validation for", self.current_model_name)
self.results[self.current_model_name] = self.mean_cross_validation_score
models.append(model)
任何时候我 运行 这个交叉验证,即使我在不同的模型上设置了随机状态,我也会得到不同的结果。我想知道为什么我在交叉验证中得到不同的结果以及可以做些什么
这是因为您没有为您的 k 折生成器设置 random_state。默认情况下,当您将 cv
的 int
值传递为
cross_validate = cross_val_score(model, self.xtrain, self.ytrain, cv=4)
cross_val_score
将使用不同的随机状态调用 (Stratified)KFold
,每次调用都会导致模型的参数不同,从而导致不同的结果。
相关部分来自source file.
cv: int, cross-validation generator or an iterable, default=None
Determines the cross-validation splitting strategy.
Possible inputs for cv are:
- None, to use the default 5-fold cross validation,
- int, to specify the number of folds in a `(Stratified)KFold`,
- :term:`CV splitter`,
- An iterable yielding (train, test) splits as arrays of indices.
For int/None inputs, if the estimator is a classifier and ``y`` is
either binary or multiclass, :class:`StratifiedKFold` is used. In all
other cases, :class:`KFold` is used.
要解决此问题,您可以传递自己的交叉验证生成器,该生成器具有上述文档中所述的受控随机状态。例如:
# (code untested)
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=4, random_state=42)
cross_validate = cross_val_score(model, self.xtrain, self.ytrain, cv=skf)
我找到了问题的答案。
使用以下设置随机种子解决了问题:
seed = np.random.seed(22)
我所有的模型都用下面的初始化:
def intiailize_clf_models(self):
model = RandomForestClassifier(random_state=42)
self.clf_models.append((model))
model = ExtraTreesClassifier(random_state=42)
self.clf_models.append((model))
model = MLPClassifier(random_state=42)
self.clf_models.append((model))
model = LogisticRegression(random_state=42)
self.clf_models.append((model))
model = xgb.XGBClassifier(random_state=42)
self.clf_models.append((model))
model = lgb.LGBMClassifier(random_state=42)
self.clf_models.append((model))
循环遍历模型并执行 k 折交叉验证:
def kfold_cross_validation(self):
clf_models = self.get_models()
models = []
self.results = {}
for model in clf_models:
self.current_model_name = model.__class__.__name__
cross_validate = cross_val_score(model, self.xtrain, self.ytrain, cv=4)
self.mean_cross_validation_score = cross_validate.mean()
print("Kfold cross validation for", self.current_model_name)
self.results[self.current_model_name] = self.mean_cross_validation_score
models.append(model)
任何时候我 运行 这个交叉验证,即使我在不同的模型上设置了随机状态,我也会得到不同的结果。我想知道为什么我在交叉验证中得到不同的结果以及可以做些什么
这是因为您没有为您的 k 折生成器设置 random_state。默认情况下,当您将 cv
的 int
值传递为
cross_validate = cross_val_score(model, self.xtrain, self.ytrain, cv=4)
cross_val_score
将使用不同的随机状态调用 (Stratified)KFold
,每次调用都会导致模型的参数不同,从而导致不同的结果。
相关部分来自source file.
cv: int, cross-validation generator or an iterable, default=None
Determines the cross-validation splitting strategy.
Possible inputs for cv are:
- None, to use the default 5-fold cross validation,
- int, to specify the number of folds in a `(Stratified)KFold`,
- :term:`CV splitter`,
- An iterable yielding (train, test) splits as arrays of indices.
For int/None inputs, if the estimator is a classifier and ``y`` is
either binary or multiclass, :class:`StratifiedKFold` is used. In all
other cases, :class:`KFold` is used.
要解决此问题,您可以传递自己的交叉验证生成器,该生成器具有上述文档中所述的受控随机状态。例如:
# (code untested)
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=4, random_state=42)
cross_validate = cross_val_score(model, self.xtrain, self.ytrain, cv=skf)
我找到了问题的答案。
使用以下设置随机种子解决了问题:
seed = np.random.seed(22)