当 RMSLE 是 eval 指标时,lightgbm 的提前停止不起作用
Early stopping for lightgbm not working when RMSLE is the eval metric
我正在尝试使用 rmsle 作为评估指标在 Python 中训练一个 lightgbm ML 模型,但是当我尝试包括提前停止时遇到了问题。
这是我的代码:
import numpy as np
import pandas as pd
import lightgbm as lgb
from sklearn.model_selection import train_test_split
df_train = pd.read_csv('train_data.csv')
X_train = df_train.drop('target', axis=1)
y_train = np.log(df_train['target'])
sample_params = {
'boosting_type': 'gbdt',
'objective': 'regression',
'random_state': 42,
'metric': 'rmsle',
'lambda_l1': 5,
'lambda_l2': 5,
'num_leaves': 5,
'bagging_freq': 5,
'max_depth': 5,
'max_bin': 5,
'min_child_samples': 5,
'feature_fraction': 0.5,
'bagging_fraction': 0.5,
'learning_rate': 0.1,
}
X_train_tr, X_train_val, y_train_tr, y_train_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)
def train_lightgbm(X_train_tr, y_train_tr, X_train_val, y_train_val, params, num_boost_round, early_stopping_rounds, verbose_eval):
d_train = lgb.Dataset(X_train_tr, y_train_tr)
d_val = lgb.Dataset(X_train_val, y_train_val)
model = lgb.train(
params=params,
train_set=d_train,
num_boost_round=num_boost_round,
valid_sets=d_val,
early_stopping_rounds=early_stopping_rounds,
verbose_eval=verbose_eval,
)
return model
model = train_lightgbm(
X_train_tr,
y_train_tr,
X_train_val,
y_train_val,
params=sample_params,
num_boost_round=500,
early_stopping_rounds=True,
verbose_eval=1
)
df_test = pd.read_csv('test_data.csv')
X_test = df_test.drop('target', axis=1)
y_test = np.log(df_test['target'])
df_train['prediction'] = np.exp(model.predict(X_train))
df_test['prediction'] = np.exp(model.predict(X_test))
def rmsle(y_true, y_pred):
assert len(y_true) == len(y_pred)
return np.sqrt(np.mean(np.power(np.log1p(y_true + 1) - np.log1p(y_pred + 1), 2)))
metric = rmsle(y_test, df_test['prediction'])
print('Test Metric Value:', round(metric, 4))
如果我在 train_lightgbm 方法中更改 early_stopping_rounds=False
,代码编译没有问题。
但是,如果我设置 early_stopping_rounds=True
它会抛出以下内容:
ValueError:对于提前停止,至少需要一个数据集和评估指标进行评估。
如果我 运行 一个类似的脚本但是使用 'metric': 'rmse' 而不是 sample_params 中的 'rmsle',即使 [=12] 它也会编译=].
我需要为 lightgbm 添加什么来识别我的数据集和评估指标?谢谢!
默认情况下,LGB 不支持将 rmsle 作为度量标准(检查 here 可用列表)
要应用此自定义指标,您必须定义一个自定义函数
def rmsle_lgbm(y_pred, data):
y_true = np.array(data.get_label())
score = np.sqrt(np.mean(np.power(np.log1p(y_true) - np.log1p(y_pred), 2)))
return 'rmsle', score, False
以这种方式重新定义您的参数字典:
params = {
....
'objective': 'regression',
'metric': 'custom', # <=============
....
}
然后进行训练
model = lgb.train(
params=params,
train_set=d_train,
num_boost_round=num_boost_round,
valid_sets=d_val,
early_stopping_rounds=early_stopping_rounds,
verbose_eval=verbose_eval,
feval=rmsle_lgbm # <=============
)
PS: np.log(y + 1) = np.log1p(y) ===> np.log1p(y + 1) 似乎是个错误
我正在尝试使用 rmsle 作为评估指标在 Python 中训练一个 lightgbm ML 模型,但是当我尝试包括提前停止时遇到了问题。
这是我的代码:
import numpy as np
import pandas as pd
import lightgbm as lgb
from sklearn.model_selection import train_test_split
df_train = pd.read_csv('train_data.csv')
X_train = df_train.drop('target', axis=1)
y_train = np.log(df_train['target'])
sample_params = {
'boosting_type': 'gbdt',
'objective': 'regression',
'random_state': 42,
'metric': 'rmsle',
'lambda_l1': 5,
'lambda_l2': 5,
'num_leaves': 5,
'bagging_freq': 5,
'max_depth': 5,
'max_bin': 5,
'min_child_samples': 5,
'feature_fraction': 0.5,
'bagging_fraction': 0.5,
'learning_rate': 0.1,
}
X_train_tr, X_train_val, y_train_tr, y_train_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)
def train_lightgbm(X_train_tr, y_train_tr, X_train_val, y_train_val, params, num_boost_round, early_stopping_rounds, verbose_eval):
d_train = lgb.Dataset(X_train_tr, y_train_tr)
d_val = lgb.Dataset(X_train_val, y_train_val)
model = lgb.train(
params=params,
train_set=d_train,
num_boost_round=num_boost_round,
valid_sets=d_val,
early_stopping_rounds=early_stopping_rounds,
verbose_eval=verbose_eval,
)
return model
model = train_lightgbm(
X_train_tr,
y_train_tr,
X_train_val,
y_train_val,
params=sample_params,
num_boost_round=500,
early_stopping_rounds=True,
verbose_eval=1
)
df_test = pd.read_csv('test_data.csv')
X_test = df_test.drop('target', axis=1)
y_test = np.log(df_test['target'])
df_train['prediction'] = np.exp(model.predict(X_train))
df_test['prediction'] = np.exp(model.predict(X_test))
def rmsle(y_true, y_pred):
assert len(y_true) == len(y_pred)
return np.sqrt(np.mean(np.power(np.log1p(y_true + 1) - np.log1p(y_pred + 1), 2)))
metric = rmsle(y_test, df_test['prediction'])
print('Test Metric Value:', round(metric, 4))
如果我在 train_lightgbm 方法中更改 early_stopping_rounds=False
,代码编译没有问题。
但是,如果我设置 early_stopping_rounds=True
它会抛出以下内容:
ValueError:对于提前停止,至少需要一个数据集和评估指标进行评估。
如果我 运行 一个类似的脚本但是使用 'metric': 'rmse' 而不是 sample_params 中的 'rmsle',即使 [=12] 它也会编译=].
我需要为 lightgbm 添加什么来识别我的数据集和评估指标?谢谢!
默认情况下,LGB 不支持将 rmsle 作为度量标准(检查 here 可用列表)
要应用此自定义指标,您必须定义一个自定义函数
def rmsle_lgbm(y_pred, data):
y_true = np.array(data.get_label())
score = np.sqrt(np.mean(np.power(np.log1p(y_true) - np.log1p(y_pred), 2)))
return 'rmsle', score, False
以这种方式重新定义您的参数字典:
params = {
....
'objective': 'regression',
'metric': 'custom', # <=============
....
}
然后进行训练
model = lgb.train(
params=params,
train_set=d_train,
num_boost_round=num_boost_round,
valid_sets=d_val,
early_stopping_rounds=early_stopping_rounds,
verbose_eval=verbose_eval,
feval=rmsle_lgbm # <=============
)
PS: np.log(y + 1) = np.log1p(y) ===> np.log1p(y + 1) 似乎是个错误