Light GBM 回归 CV 解释结果
Light GBM Regression CV Interpreting Results
我查看了文档,但找不到我的问题的答案,希望这里有人知道。
这是一些示例代码:
N_FOLDS= 5
model = lgb.LGBMClassifier()
default_params = model.get_params()
#overwriting a param
default_params['objective'] = 'regression'
cv_results = lgb.cv(default_params, train_set, num_boost_round = 100000, nfold = N_FOLDS,
early_stopping_rounds = 100, metrics = 'rmse', seed = 50, stratified=False)
我得到了这样一个有 6 个不同值的字典:
{'rmse-mean': [635.2078190031074,
632.0847253839236,
629.6661071275558,
627.9721515847672,
626.6712284533291,
625.293530527769],
'rmse-stdv': [197.5088741303537,
198.66960690389863,
199.56134068525006,
200.25929541235243,
200.8251430042537,
201.50213772830526]}
起初,我认为该字典中的值对应于每个折叠的 RMSE(在本例中为 5),但事实并非如此。字典看起来像是按 RMSE 值递减。
有谁知道每个值对应的是什么?
它不对应于折叠,而是对应于每个增强轮的 cv 结果(所有测试折叠的 RMSE 的平均值),如果我们只说 5 轮并打印每个结果,您可以非常清楚地看到这一点回合:
import lightgbm as lgb
from sklearn.datasets import load_boston
X, y = load_boston(return_X_y=True)
train_set = lgb.Dataset(X,label = y)
params = {'learning_rate': 0.05,'num_leaves': 4,'subsample': 0.5}
cv_results = lgb.cv(params, train_set, num_boost_round = 5, nfold = N_FOLDS, verbose_eval = True,
early_stopping_rounds = None, metrics = 'rmse', seed = 50, stratified=False)
[LightGBM] [Info] Total Bins 1251
[LightGBM] [Info] Number of data points in the train set: 404, number of used features: 13
[LightGBM] [Info] Start training from score 22.585149
[LightGBM] [Info] Start training from score 22.109406
[LightGBM] [Info] Start training from score 22.579703
[LightGBM] [Info] Start training from score 22.784158
[LightGBM] [Info] Start training from score 22.599010
[1] cv_agg's rmse: 8.86903 + 0.88135
[2] cv_agg's rmse: 8.58355 + 0.860252
[3] cv_agg's rmse: 8.31477 + 0.842578
[4] cv_agg's rmse: 8.06201 + 0.82627
[5] cv_agg's rmse: 7.8268 + 0.800053
import pandas as pd
pd.DataFrame(cv_results)
rmse-mean rmse-stdv
0 8.869030 0.881350
1 8.583552 0.860252
2 8.314774 0.842578
3 8.062014 0.826270
4 7.826800 0.800053
在您的 post 中,您设置了 early_stopping_rounds = 100
并使用了默认值 learning rate = 0.1
根据您的数据,这可能有点高,所以它很可能在 6 后停止回合。
使用上面的相同示例,如果我们设置 early_stopping_rounds = 100
,它会每 100 轮评估一次指标的改进,returns 停止前 100 轮的结果:
cv_results = lgb.cv(params, train_set, num_boost_round = 2000, nfold = N_FOLDS,
verbose_eval = True,early_stopping_rounds = 100, metrics = 'rmse',
seed = 50, stratified=False)
[...]
[1475] cv_agg's rmse: 3.20605 + 0.50213
[1476] cv_agg's rmse: 3.20616 + 0.501997
[1477] cv_agg's rmse: 3.20607 + 0.501998
[1478] cv_agg's rmse: 3.20636 + 0.501865
[1479] cv_agg's rmse: 3.20631 + 0.501905
[1480] cv_agg's rmse: 3.20633 + 0.501731
[1481] cv_agg's rmse: 3.20659 + 0.501494
[1482] cv_agg's rmse: 3.2068 + 0.502046
[1483] cv_agg's rmse: 3.20687 + 0.50213
[1484] cv_agg's rmse: 3.20701 + 0.502265
[1485] cv_agg's rmse: 3.20717 + 0.502096
[1486] cv_agg's rmse: 3.2072 + 0.501779
[1487] cv_agg's rmse: 3.20722 + 0.501613
[1488] cv_agg's rmse: 3.20718 + 0.501308
[1489] cv_agg's rmse: 3.20701 + 0.501232
pd.DataFrame(cv_results).shape
(1389, 2)
如果您想估计模型的 rmse,请取最后一个值。
我查看了文档,但找不到我的问题的答案,希望这里有人知道。 这是一些示例代码:
N_FOLDS= 5
model = lgb.LGBMClassifier()
default_params = model.get_params()
#overwriting a param
default_params['objective'] = 'regression'
cv_results = lgb.cv(default_params, train_set, num_boost_round = 100000, nfold = N_FOLDS,
early_stopping_rounds = 100, metrics = 'rmse', seed = 50, stratified=False)
我得到了这样一个有 6 个不同值的字典:
{'rmse-mean': [635.2078190031074,
632.0847253839236,
629.6661071275558,
627.9721515847672,
626.6712284533291,
625.293530527769],
'rmse-stdv': [197.5088741303537,
198.66960690389863,
199.56134068525006,
200.25929541235243,
200.8251430042537,
201.50213772830526]}
起初,我认为该字典中的值对应于每个折叠的 RMSE(在本例中为 5),但事实并非如此。字典看起来像是按 RMSE 值递减。
有谁知道每个值对应的是什么?
它不对应于折叠,而是对应于每个增强轮的 cv 结果(所有测试折叠的 RMSE 的平均值),如果我们只说 5 轮并打印每个结果,您可以非常清楚地看到这一点回合:
import lightgbm as lgb
from sklearn.datasets import load_boston
X, y = load_boston(return_X_y=True)
train_set = lgb.Dataset(X,label = y)
params = {'learning_rate': 0.05,'num_leaves': 4,'subsample': 0.5}
cv_results = lgb.cv(params, train_set, num_boost_round = 5, nfold = N_FOLDS, verbose_eval = True,
early_stopping_rounds = None, metrics = 'rmse', seed = 50, stratified=False)
[LightGBM] [Info] Total Bins 1251
[LightGBM] [Info] Number of data points in the train set: 404, number of used features: 13
[LightGBM] [Info] Start training from score 22.585149
[LightGBM] [Info] Start training from score 22.109406
[LightGBM] [Info] Start training from score 22.579703
[LightGBM] [Info] Start training from score 22.784158
[LightGBM] [Info] Start training from score 22.599010
[1] cv_agg's rmse: 8.86903 + 0.88135
[2] cv_agg's rmse: 8.58355 + 0.860252
[3] cv_agg's rmse: 8.31477 + 0.842578
[4] cv_agg's rmse: 8.06201 + 0.82627
[5] cv_agg's rmse: 7.8268 + 0.800053
import pandas as pd
pd.DataFrame(cv_results)
rmse-mean rmse-stdv
0 8.869030 0.881350
1 8.583552 0.860252
2 8.314774 0.842578
3 8.062014 0.826270
4 7.826800 0.800053
在您的 post 中,您设置了 early_stopping_rounds = 100
并使用了默认值 learning rate = 0.1
根据您的数据,这可能有点高,所以它很可能在 6 后停止回合。
使用上面的相同示例,如果我们设置 early_stopping_rounds = 100
,它会每 100 轮评估一次指标的改进,returns 停止前 100 轮的结果:
cv_results = lgb.cv(params, train_set, num_boost_round = 2000, nfold = N_FOLDS,
verbose_eval = True,early_stopping_rounds = 100, metrics = 'rmse',
seed = 50, stratified=False)
[...]
[1475] cv_agg's rmse: 3.20605 + 0.50213
[1476] cv_agg's rmse: 3.20616 + 0.501997
[1477] cv_agg's rmse: 3.20607 + 0.501998
[1478] cv_agg's rmse: 3.20636 + 0.501865
[1479] cv_agg's rmse: 3.20631 + 0.501905
[1480] cv_agg's rmse: 3.20633 + 0.501731
[1481] cv_agg's rmse: 3.20659 + 0.501494
[1482] cv_agg's rmse: 3.2068 + 0.502046
[1483] cv_agg's rmse: 3.20687 + 0.50213
[1484] cv_agg's rmse: 3.20701 + 0.502265
[1485] cv_agg's rmse: 3.20717 + 0.502096
[1486] cv_agg's rmse: 3.2072 + 0.501779
[1487] cv_agg's rmse: 3.20722 + 0.501613
[1488] cv_agg's rmse: 3.20718 + 0.501308
[1489] cv_agg's rmse: 3.20701 + 0.501232
pd.DataFrame(cv_results).shape
(1389, 2)
如果您想估计模型的 rmse,请取最后一个值。