将 GridSearchCV 用于 RandomForestRegressor
Using GridSearchCV for RandomForestRegressor
我正在尝试将 GridSearchCV
用于 RandomForestRegressor
,但总是得到 ValueError: Found array with dim 100. Expected 500
。考虑这个玩具示例:
import numpy as np
from sklearn import ensemble
from sklearn.cross_validation import train_test_split
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import r2_score
if __name__ == '__main__':
X = np.random.rand(1000, 2)
y = np.random.rand(1000)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.5, random_state=1)
# Set the parameters by cross-validation
tuned_parameters = {'n_estimators': [500, 700, 1000], 'max_depth': [None, 1, 2, 3], 'min_samples_split': [1, 2, 3]}
# clf = ensemble.RandomForestRegressor(n_estimators=500, n_jobs=1, verbose=1)
clf = GridSearchCV(ensemble.RandomForestRegressor(), tuned_parameters, cv=5, scoring=r2_score, n_jobs=-1, verbose=1)
clf.fit(X_train, y_train)
print clf.best_estimator_
这是我得到的:
Fitting 5 folds for each of 36 candidates, totalling 180 fits
Traceback (most recent call last):
File "C:\Users\abudis\Dropbox\machine_learning\toy_example.py", line 21, in <module>
clf.fit(X_train, y_train)
File "C:\Users\abudis\AppData\Local\Enthought\Canopy\User\lib\site-packages\sklearn\grid_search.py", line 596, in fit
return self._fit(X, y, ParameterGrid(self.param_grid))
File "C:\Users\abudis\AppData\Local\Enthought\Canopy\User\lib\site-packages\sklearn\grid_search.py", line 378, in _fit
for parameters in parameter_iterable
File "C:\Users\abudis\AppData\Local\Enthought\Canopy\User\lib\site-packages\sklearn\externals\joblib\parallel.py", line 653, in __call__
self.dispatch(function, args, kwargs)
File "C:\Users\abudis\AppData\Local\Enthought\Canopy\User\lib\site-packages\sklearn\externals\joblib\parallel.py", line 400, in dispatch
job = ImmediateApply(func, args, kwargs)
File "C:\Users\abudis\AppData\Local\Enthought\Canopy\User\lib\site-packages\sklearn\externals\joblib\parallel.py", line 138, in __init__
self.results = func(*args, **kwargs)
File "C:\Users\abudis\AppData\Local\Enthought\Canopy\User\lib\site-packages\sklearn\cross_validation.py", line 1240, in _fit_and_score
test_score = _score(estimator, X_test, y_test, scorer)
File "C:\Users\abudis\AppData\Local\Enthought\Canopy\User\lib\site-packages\sklearn\cross_validation.py", line 1296, in _score
score = scorer(estimator, X_test, y_test)
File "C:\Users\abudis\AppData\Local\Enthought\Canopy\User\lib\site-packages\sklearn\metrics\metrics.py", line 2324, in r2_score
y_type, y_true, y_pred = _check_reg_targets(y_true, y_pred)
File "C:\Users\abudis\AppData\Local\Enthought\Canopy\User\lib\site-packages\sklearn\metrics\metrics.py", line 65, in _check_reg_targets
y_true, y_pred = check_arrays(y_true, y_pred)
File "C:\Users\abudis\AppData\Local\Enthought\Canopy\User\lib\site-packages\sklearn\utils\validation.py", line 254, in check_arrays
% (size, n_samples))
ValueError: Found array with dim 100. Expected 500
出于某种原因 GridSearchCV
认为 n_estimators
参数应该等于每个折叠的大小。如果我更改 tuned_parameters 列表中 n_estimators
的第一个值,我会得到 ValueError
和另一个预期值。
尽管使用 clf = ensemble.RandomForestRegressor(n_estimators=500, n_jobs=1, verbose=1)
只训练一个模型效果很好,所以不确定是我做错了什么还是 scikit-learn
某处存在错误。
看起来像一个错误,但在你的情况下,如果你使用 RandomForestRegressor
自己的评分器(巧合的是 R^2 评分)而不在 GridSearchCV
中指定任何评分函数,它应该可以工作:
clf = GridSearchCV(ensemble.RandomForestRegressor(), tuned_parameters, cv=5,
n_jobs=-1, verbose=1)
编辑:正如@jnothman 在#4081 中提到的,这是真正的问题:
scoring does not accept a metric function. It accepts a function of signature (estimator, > X, y_true=None) -> float score. You can use scoring='r2' or scoring=make_scorer(r2_score).
我正在尝试将 GridSearchCV
用于 RandomForestRegressor
,但总是得到 ValueError: Found array with dim 100. Expected 500
。考虑这个玩具示例:
import numpy as np
from sklearn import ensemble
from sklearn.cross_validation import train_test_split
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import r2_score
if __name__ == '__main__':
X = np.random.rand(1000, 2)
y = np.random.rand(1000)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.5, random_state=1)
# Set the parameters by cross-validation
tuned_parameters = {'n_estimators': [500, 700, 1000], 'max_depth': [None, 1, 2, 3], 'min_samples_split': [1, 2, 3]}
# clf = ensemble.RandomForestRegressor(n_estimators=500, n_jobs=1, verbose=1)
clf = GridSearchCV(ensemble.RandomForestRegressor(), tuned_parameters, cv=5, scoring=r2_score, n_jobs=-1, verbose=1)
clf.fit(X_train, y_train)
print clf.best_estimator_
这是我得到的:
Fitting 5 folds for each of 36 candidates, totalling 180 fits
Traceback (most recent call last):
File "C:\Users\abudis\Dropbox\machine_learning\toy_example.py", line 21, in <module>
clf.fit(X_train, y_train)
File "C:\Users\abudis\AppData\Local\Enthought\Canopy\User\lib\site-packages\sklearn\grid_search.py", line 596, in fit
return self._fit(X, y, ParameterGrid(self.param_grid))
File "C:\Users\abudis\AppData\Local\Enthought\Canopy\User\lib\site-packages\sklearn\grid_search.py", line 378, in _fit
for parameters in parameter_iterable
File "C:\Users\abudis\AppData\Local\Enthought\Canopy\User\lib\site-packages\sklearn\externals\joblib\parallel.py", line 653, in __call__
self.dispatch(function, args, kwargs)
File "C:\Users\abudis\AppData\Local\Enthought\Canopy\User\lib\site-packages\sklearn\externals\joblib\parallel.py", line 400, in dispatch
job = ImmediateApply(func, args, kwargs)
File "C:\Users\abudis\AppData\Local\Enthought\Canopy\User\lib\site-packages\sklearn\externals\joblib\parallel.py", line 138, in __init__
self.results = func(*args, **kwargs)
File "C:\Users\abudis\AppData\Local\Enthought\Canopy\User\lib\site-packages\sklearn\cross_validation.py", line 1240, in _fit_and_score
test_score = _score(estimator, X_test, y_test, scorer)
File "C:\Users\abudis\AppData\Local\Enthought\Canopy\User\lib\site-packages\sklearn\cross_validation.py", line 1296, in _score
score = scorer(estimator, X_test, y_test)
File "C:\Users\abudis\AppData\Local\Enthought\Canopy\User\lib\site-packages\sklearn\metrics\metrics.py", line 2324, in r2_score
y_type, y_true, y_pred = _check_reg_targets(y_true, y_pred)
File "C:\Users\abudis\AppData\Local\Enthought\Canopy\User\lib\site-packages\sklearn\metrics\metrics.py", line 65, in _check_reg_targets
y_true, y_pred = check_arrays(y_true, y_pred)
File "C:\Users\abudis\AppData\Local\Enthought\Canopy\User\lib\site-packages\sklearn\utils\validation.py", line 254, in check_arrays
% (size, n_samples))
ValueError: Found array with dim 100. Expected 500
出于某种原因 GridSearchCV
认为 n_estimators
参数应该等于每个折叠的大小。如果我更改 tuned_parameters 列表中 n_estimators
的第一个值,我会得到 ValueError
和另一个预期值。
尽管使用 clf = ensemble.RandomForestRegressor(n_estimators=500, n_jobs=1, verbose=1)
只训练一个模型效果很好,所以不确定是我做错了什么还是 scikit-learn
某处存在错误。
看起来像一个错误,但在你的情况下,如果你使用 RandomForestRegressor
自己的评分器(巧合的是 R^2 评分)而不在 GridSearchCV
中指定任何评分函数,它应该可以工作:
clf = GridSearchCV(ensemble.RandomForestRegressor(), tuned_parameters, cv=5,
n_jobs=-1, verbose=1)
编辑:正如@jnothman 在#4081 中提到的,这是真正的问题:
scoring does not accept a metric function. It accepts a function of signature (estimator, > X, y_true=None) -> float score. You can use scoring='r2' or scoring=make_scorer(r2_score).