cross_val_score 默认评分不一致?
cross_val_score default scoring not consistent?
根据 docs,
对于 cross_val_score
的 scoring
参数:
如果 None,则使用估算器的默认评分器(如果可用)。
对于 DecisionTreeRegressor
,默认标准是 mse
。那么为什么我在这里得到不同的结果?
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
dt = DecisionTreeRegressor(max_depth=4, min_samples_leaf=0.26)
- cross_val_score(dt, X_train, y_train, cv=10, scoring='neg_mean_squared_error')
>>> array([ 46.94808341, 18.78121305, 18.19914701, 18.06935431,
17.19546733, 28.91247609, 39.41410887, 21.30453162,
31.96443414, 23.74191199])
cross_val_score(dt, X_train, y_train, cv=10)
>>> array([ 0.35723619, 0.75254466, 0.7181376 , 0.65718608, 0.72531937,
0.4752839 , 0.43169728, 0.63916363, 0.41406146, 0.68977882])
如果非要我猜的话,默认的 scoring
似乎是 R2
而不是 mse
。我对默认记分器的理解是正确的还是这是一个错误?
DecisionTreeRegression
的默认记分器是 r2-score
,您可以在 DecisionTreeRegressor 的 docs 中找到它。
score(self, X, y, sample_weight=None)[source]
Return the coefficient of determination R^2 of the prediction.
The coefficient R^2 is defined as (1 - u/v), where u is the residual sum of squares ((y_true - y_pred) ** 2).sum() and v is the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0.
@PV8说的很对,但我想指出两个细节
细节 #1:如何使用 r2-score
作为评分标准?答案:make_scorer
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_diabetes
from sklearn.model_selection import cross_val_score
from sklearn.metrics import r2_score, make_scorer
from sklearn.tree import DecisionTreeRegressor
X, y = load_diabetes(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
dt = DecisionTreeRegressor(max_depth=4, min_samples_leaf=0.26)
print(cross_val_score(dt, X_train, y_train, cv=10, scoring=make_scorer(r2_score)))
如果你运行这个程序多次你仍然会得到不同的结果。
细节#2:如何获得一致的结果?
您需要设置 random_state
变量以获得恒定的结果。
例如:
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_diabetes
from sklearn.model_selection import cross_val_score
from sklearn.metrics import r2_score, make_scorer
from sklearn.tree import DecisionTreeRegressor
X, y = load_diabetes(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
dt = DecisionTreeRegressor(max_depth=4, min_samples_leaf=0.26)
print(cross_val_score(dt, X_train, y_train, cv=10, scoring=make_scorer(r2_score)))
结果总是一样的。
根据 docs,
对于 cross_val_score
的 scoring
参数:
如果 None,则使用估算器的默认评分器(如果可用)。
对于 DecisionTreeRegressor
,默认标准是 mse
。那么为什么我在这里得到不同的结果?
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
dt = DecisionTreeRegressor(max_depth=4, min_samples_leaf=0.26)
- cross_val_score(dt, X_train, y_train, cv=10, scoring='neg_mean_squared_error')
>>> array([ 46.94808341, 18.78121305, 18.19914701, 18.06935431,
17.19546733, 28.91247609, 39.41410887, 21.30453162,
31.96443414, 23.74191199])
cross_val_score(dt, X_train, y_train, cv=10)
>>> array([ 0.35723619, 0.75254466, 0.7181376 , 0.65718608, 0.72531937,
0.4752839 , 0.43169728, 0.63916363, 0.41406146, 0.68977882])
如果非要我猜的话,默认的 scoring
似乎是 R2
而不是 mse
。我对默认记分器的理解是正确的还是这是一个错误?
DecisionTreeRegression
的默认记分器是 r2-score
,您可以在 DecisionTreeRegressor 的 docs 中找到它。
score(self, X, y, sample_weight=None)[source]
Return the coefficient of determination R^2 of the prediction.
The coefficient R^2 is defined as (1 - u/v), where u is the residual sum of squares ((y_true - y_pred) ** 2).sum() and v is the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0.
@PV8说的很对,但我想指出两个细节
细节 #1:如何使用 r2-score
作为评分标准?答案:make_scorer
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_diabetes
from sklearn.model_selection import cross_val_score
from sklearn.metrics import r2_score, make_scorer
from sklearn.tree import DecisionTreeRegressor
X, y = load_diabetes(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
dt = DecisionTreeRegressor(max_depth=4, min_samples_leaf=0.26)
print(cross_val_score(dt, X_train, y_train, cv=10, scoring=make_scorer(r2_score)))
如果你运行这个程序多次你仍然会得到不同的结果。
细节#2:如何获得一致的结果?
您需要设置 random_state
变量以获得恒定的结果。
例如:
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_diabetes
from sklearn.model_selection import cross_val_score
from sklearn.metrics import r2_score, make_scorer
from sklearn.tree import DecisionTreeRegressor
X, y = load_diabetes(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
dt = DecisionTreeRegressor(max_depth=4, min_samples_leaf=0.26)
print(cross_val_score(dt, X_train, y_train, cv=10, scoring=make_scorer(r2_score)))
结果总是一样的。