当我使用 "r2" 作为评分时,sklearn cross_val_score() returns NaN 值
sklearn cross_val_score() returns NaN values when I use "r2" as scoring
我正在尝试使用 sklearn cross_val_score()。以下是我试过的例子:
# loocv evaluate random forest on the housing dataset
from numpy import mean
from numpy import std
from numpy import absolute
from pandas import read_csv
from sklearn.model_selection import LeaveOneOut
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv'
dataframe = read_csv(url, header=None)
data = dataframe.values
# split into inputs and outputs
X, y = data[:, :-1], data[:, -1]
print(X.shape, y.shape)
# create loocv procedure
cv = LeaveOneOut()
# create model
model = RandomForestRegressor(random_state=1)
# evaluate model
scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1)
# force positive
scores = absolute(scores)
# report performance
print('MAE: %.3f (%.3f)' % (mean(scores), std(scores)))
上面的代码工作正常,没有任何问题。但是,当我将 scoring
更改为 r2
时,scores
中的所有值将变为 nan
.
问题是将 LeaveOneOut()
与 r2
结合使用作为评分函数。 LeaveOneOut()
will split the data in such a way that only one sample is used for testing and the remaining is used for training. And here comes the problem, when you compute r2
在验证集上使用这个公式:
分母变为零,因为 n=1
(只有一个样本可以验证)所以 y_bar = y_i
因为平均值等于你拥有的一个数字,这导致 nan
你观察。如果您的 cv = No. of data points
如下所示,这必然会发生:
# evaluate model
scores = cross_val_score(model, X[0:10], y[0:10], scoring='r2', cv=10, n_jobs=-1)
# force positive
scores = absolute(scores)
# report performance
print('MAE: %.3f (%.3f)' % (mean(scores), std(scores)))
MAE: nan (nan)
现在,当我为 n
设置一些其他值时,它工作正常:
# evaluate model
scores = cross_val_score(model, X[0:10], y[0:10], scoring='r2', cv=3, n_jobs=-1)
# force positive
scores = absolute(scores)
# report performance
print('MAE: %.3f (%.3f)' % (mean(scores), std(scores)))
MAE: 0.662 (0.229)
我正在尝试使用 sklearn cross_val_score()。以下是我试过的例子:
# loocv evaluate random forest on the housing dataset
from numpy import mean
from numpy import std
from numpy import absolute
from pandas import read_csv
from sklearn.model_selection import LeaveOneOut
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv'
dataframe = read_csv(url, header=None)
data = dataframe.values
# split into inputs and outputs
X, y = data[:, :-1], data[:, -1]
print(X.shape, y.shape)
# create loocv procedure
cv = LeaveOneOut()
# create model
model = RandomForestRegressor(random_state=1)
# evaluate model
scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1)
# force positive
scores = absolute(scores)
# report performance
print('MAE: %.3f (%.3f)' % (mean(scores), std(scores)))
上面的代码工作正常,没有任何问题。但是,当我将 scoring
更改为 r2
时,scores
中的所有值将变为 nan
.
问题是将 LeaveOneOut()
与 r2
结合使用作为评分函数。 LeaveOneOut()
will split the data in such a way that only one sample is used for testing and the remaining is used for training. And here comes the problem, when you compute r2
在验证集上使用这个公式:
分母变为零,因为 n=1
(只有一个样本可以验证)所以 y_bar = y_i
因为平均值等于你拥有的一个数字,这导致 nan
你观察。如果您的 cv = No. of data points
如下所示,这必然会发生:
# evaluate model
scores = cross_val_score(model, X[0:10], y[0:10], scoring='r2', cv=10, n_jobs=-1)
# force positive
scores = absolute(scores)
# report performance
print('MAE: %.3f (%.3f)' % (mean(scores), std(scores)))
MAE: nan (nan)
现在,当我为 n
设置一些其他值时,它工作正常:
# evaluate model
scores = cross_val_score(model, X[0:10], y[0:10], scoring='r2', cv=3, n_jobs=-1)
# force positive
scores = absolute(scores)
# report performance
print('MAE: %.3f (%.3f)' % (mean(scores), std(scores)))
MAE: 0.662 (0.229)