How to fix "IndexError: tuple index out of range" in python?
How to fix "IndexError: tuple index out of range" in python?
我正在使用 sklearn
模块来寻找最合适的模型和模型参数。但是,我在下方遇到意外的索引错误:
> IndexError Traceback (most recent call
> last) <ipython-input-38-ea3f99e30226> in <module>
> 22 s = mean_squared_error(y[ts], best_m.predict(X[ts]))
> 23 cv[i].append(s)
> ---> 24 print(np.mean(cv, 1))
> IndexError: tuple index out of range
我想做的是找到最合适的回归量及其参数,但我遇到了上述错误。我调查了 SO
并尝试了 this solution 但仍然出现了同样的错误。任何想法来修复这个错误?谁能指出我为什么会发生此错误?有什么想法吗?
我的代码:
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from xgboost.sklearn import XGBRegressor
from sklearn.datasets import make_regression
models = [SVR(), RandomForestRegressor(), LinearRegression(), Ridge(), Lasso(), XGBRegressor()]
params = [{'C': [0.01, 1]}, {'n_estimators': [10, 20]}]
X, y = make_regression(n_samples=10000, n_features=20)
with warnings.catch_warnings():
warnings.filterwarnings("ignore")
cv = [[] for _ in range(len(models))]
fold = KFold(5,shuffle=False)
for tr, ts in fold.split(X):
for i, (model, param) in enumerate(zip(models, params)):
best_m = GridSearchCV(model, param)
best_m.fit(X[tr], y[tr])
s = mean_squared_error(y[ts], best_m.predict(X[ts]))
cv[i].append(s)
print(np.mean(cv, 1))
期望输出:
如果有办法解决上述错误,我希望选择带参数的最佳拟合模型,然后将其用于估计。任何改进上述尝试的想法?谢谢
当你定义
cv = [[] for _ in range(len(models))]
每个模型都有一个空列表。
但是,在循环中,您遍历了 enumerate(zip(models, params))
,它只有 两个 元素,因为您的 params
列表有两个元素(因为 list(zip(x,y))
has length 等于 min(len(x),len(y)
)。
因此,您得到一个 IndexError
,因为当您使用 np.mean
.[=22 计算平均值时,cv
中的一些列表是空的(除前两个之外的所有列表) =]
解法:
如果您不需要在其余模型上使用 GridSearchCV
,您可以使用空字典扩展 params
列表:
params = [{'C': [0.01, 1]}, {'n_estimators': [10, 20]}, {}, {}, {}, {}]
问题的根本原因是,当您要求评估 GridSearchCV
中的 6 个模型时,您只提供了前两个模型的参数:
models = [SVR(), RandomForestRegressor(), LinearRegression(), Ridge(), Lasso(), XGBRegressor()]
params = [{'C': [0.01, 1]}, {'n_estimators': [10, 20]}]
此设置下enumerate(zip(models, params))
的结果,即:
for i, (model, param) in enumerate(zip(models, params)):
print((model, param))
是
(SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='auto',
kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False), {'C': [0.01, 1]})
(RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
oob_score=False, random_state=None, verbose=0, warm_start=False), {'n_estimators': [10, 20]})
即最后 4 个模型被简单地忽略,所以你在 cv
:
中得到它们的空条目
print(cv)
# result:
[[5950.6018771284835, 5987.293514740653, 6055.368320208183, 6099.316091619069, 6146.478702335218], [3625.3243553665975, 3301.3552182952058, 3404.3321983193728, 3521.5160621260898, 3561.254684271113], [], [], [], []]
试图获取 np.mean(cv, 1)
.
时导致下游错误
正如 Psi 在他们的回答中已经正确指出的那样,解决方案是在您实际上 不 执行任何 CV 搜索的模型中寻找空字典;省略了 XGBRegressor
(还没有安装),结果如下:
models = [SVR(), RandomForestRegressor(), LinearRegression(), Ridge(), Lasso()]
params2 = [{'C': [0.01, 1]}, {'n_estimators': [10, 20]}, {}, {}, {}]
cv = [[] for _ in range(len(models))]
fold = KFold(5,shuffle=False)
for tr, ts in fold.split(X):
for i, (model, param) in enumerate(zip(models, params2)):
best_m = GridSearchCV(model, param)
best_m.fit(X[tr], y[tr])
s = mean_squared_error(y[ts], best_m.predict(X[ts]))
cv[i].append(s)
其中 print(cv)
给出:
[[4048.660483326826, 3973.984055352062, 3847.7215568088545, 3907.0566348092684, 3820.0517432992765], [1037.9378737329769, 1025.237441119364, 1016.549294695313, 993.7083268195154, 963.8115632611381], [2.2948917095935095e-26, 1.971022007799432e-26, 4.1583774042712844e-26, 2.0229469068846665e-25, 1.9295075684919642e-26], [0.0003350178681602639, 0.0003297411022124562, 0.00030834076832371557, 0.0003355298330301431, 0.00032049282437794516], [10.372789356303688, 10.137748082073076, 10.136028304131141, 10.499159069700834, 9.80779910439471]]
和 print(np.mean(cv, 1))
工作正常,给出:
[3.91949489e+03 1.00744890e+03 6.11665355e-26 3.25824479e-04
1.01907048e+01]
因此,在您的情况下,您确实应该将 params
更改为:
params = [{'C': [0.01, 1]}, {'n_estimators': [10, 20]}, {}, {}, {}, {}]
正如 Psi 已经建议的那样。
我正在使用 sklearn
模块来寻找最合适的模型和模型参数。但是,我在下方遇到意外的索引错误:
> IndexError Traceback (most recent call
> last) <ipython-input-38-ea3f99e30226> in <module>
> 22 s = mean_squared_error(y[ts], best_m.predict(X[ts]))
> 23 cv[i].append(s)
> ---> 24 print(np.mean(cv, 1))
> IndexError: tuple index out of range
我想做的是找到最合适的回归量及其参数,但我遇到了上述错误。我调查了 SO
并尝试了 this solution 但仍然出现了同样的错误。任何想法来修复这个错误?谁能指出我为什么会发生此错误?有什么想法吗?
我的代码:
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from xgboost.sklearn import XGBRegressor
from sklearn.datasets import make_regression
models = [SVR(), RandomForestRegressor(), LinearRegression(), Ridge(), Lasso(), XGBRegressor()]
params = [{'C': [0.01, 1]}, {'n_estimators': [10, 20]}]
X, y = make_regression(n_samples=10000, n_features=20)
with warnings.catch_warnings():
warnings.filterwarnings("ignore")
cv = [[] for _ in range(len(models))]
fold = KFold(5,shuffle=False)
for tr, ts in fold.split(X):
for i, (model, param) in enumerate(zip(models, params)):
best_m = GridSearchCV(model, param)
best_m.fit(X[tr], y[tr])
s = mean_squared_error(y[ts], best_m.predict(X[ts]))
cv[i].append(s)
print(np.mean(cv, 1))
期望输出:
如果有办法解决上述错误,我希望选择带参数的最佳拟合模型,然后将其用于估计。任何改进上述尝试的想法?谢谢
当你定义
cv = [[] for _ in range(len(models))]
每个模型都有一个空列表。
但是,在循环中,您遍历了 enumerate(zip(models, params))
,它只有 两个 元素,因为您的 params
列表有两个元素(因为 list(zip(x,y))
has length 等于 min(len(x),len(y)
)。
因此,您得到一个 IndexError
,因为当您使用 np.mean
.[=22 计算平均值时,cv
中的一些列表是空的(除前两个之外的所有列表) =]
解法:
如果您不需要在其余模型上使用 GridSearchCV
,您可以使用空字典扩展 params
列表:
params = [{'C': [0.01, 1]}, {'n_estimators': [10, 20]}, {}, {}, {}, {}]
问题的根本原因是,当您要求评估 GridSearchCV
中的 6 个模型时,您只提供了前两个模型的参数:
models = [SVR(), RandomForestRegressor(), LinearRegression(), Ridge(), Lasso(), XGBRegressor()]
params = [{'C': [0.01, 1]}, {'n_estimators': [10, 20]}]
此设置下enumerate(zip(models, params))
的结果,即:
for i, (model, param) in enumerate(zip(models, params)):
print((model, param))
是
(SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='auto',
kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False), {'C': [0.01, 1]})
(RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
oob_score=False, random_state=None, verbose=0, warm_start=False), {'n_estimators': [10, 20]})
即最后 4 个模型被简单地忽略,所以你在 cv
:
print(cv)
# result:
[[5950.6018771284835, 5987.293514740653, 6055.368320208183, 6099.316091619069, 6146.478702335218], [3625.3243553665975, 3301.3552182952058, 3404.3321983193728, 3521.5160621260898, 3561.254684271113], [], [], [], []]
试图获取 np.mean(cv, 1)
.
正如 Psi 在他们的回答中已经正确指出的那样,解决方案是在您实际上 不 执行任何 CV 搜索的模型中寻找空字典;省略了 XGBRegressor
(还没有安装),结果如下:
models = [SVR(), RandomForestRegressor(), LinearRegression(), Ridge(), Lasso()]
params2 = [{'C': [0.01, 1]}, {'n_estimators': [10, 20]}, {}, {}, {}]
cv = [[] for _ in range(len(models))]
fold = KFold(5,shuffle=False)
for tr, ts in fold.split(X):
for i, (model, param) in enumerate(zip(models, params2)):
best_m = GridSearchCV(model, param)
best_m.fit(X[tr], y[tr])
s = mean_squared_error(y[ts], best_m.predict(X[ts]))
cv[i].append(s)
其中 print(cv)
给出:
[[4048.660483326826, 3973.984055352062, 3847.7215568088545, 3907.0566348092684, 3820.0517432992765], [1037.9378737329769, 1025.237441119364, 1016.549294695313, 993.7083268195154, 963.8115632611381], [2.2948917095935095e-26, 1.971022007799432e-26, 4.1583774042712844e-26, 2.0229469068846665e-25, 1.9295075684919642e-26], [0.0003350178681602639, 0.0003297411022124562, 0.00030834076832371557, 0.0003355298330301431, 0.00032049282437794516], [10.372789356303688, 10.137748082073076, 10.136028304131141, 10.499159069700834, 9.80779910439471]]
和 print(np.mean(cv, 1))
工作正常,给出:
[3.91949489e+03 1.00744890e+03 6.11665355e-26 3.25824479e-04
1.01907048e+01]
因此,在您的情况下,您确实应该将 params
更改为:
params = [{'C': [0.01, 1]}, {'n_estimators': [10, 20]}, {}, {}, {}, {}]
正如 Psi 已经建议的那样。