为什么带有 'objective': 'binary' 的 LightGBM 在调用方法预测时不要 return 二进制值 0 和 1?
Why LightGBM with 'objective': 'binary' donot return binary value 0 and 1 when call method predict?
我用 LightGBM 创建了一个二元分类模型:
#Dataset
y_train = data_train['Label']
X_train = data_train.drop(['Label'], axis=1)
y_test = data_test['Label']
X_test = data_test.drop(['Label'], axis=1)
train_data = lgb.Dataset(data=X_train, label=y_train)
test_data = lgb.Dataset(data=X_test, label=y_test)
#Setting default parameters
params_wo_constraints = {
'objective': 'binary',
'boosting_type': 'gbdt',
'metric': {'binary_logloss', 'auc'},
'num_leaves': 32,
'max_depth ': 5,
'min_data_in_leaf': 100,
'seed': 42,
'bagging_seed': 42,
'feature_fraction_seed': 42,
'drop_seed': 42,
'data_random_seed': 42
}
#Model training
evals_result = {}
model_wo_constraints = lgb.train(
params=params_wo_constraints,
train_set=train_data,
)
#Prediction
train_preds_wo_constraints = model_wo_constraints.predict(X_train)
test_preds_wo_constraints = model_wo_constraints.predict(X_test)
但是train_preds_wo_constraints的值不是0和1:
>>> array([7.02862608e-02, 7.02498237e-01, 4.85224849e-01, ...,
4.00079287e-04, 1.76385121e-01, 2.09733409e-01])
我试过 sklearn API 效果很好
model = lgb.LGBMClassifier(learning_rate=0.09,max_depth=5,random_state=42)
model.fit(X_train,y_train,eval_set=[(X_test,y_test),(X_train,y_train)],
verbose=20,eval_metric='logloss')
preds_wo_constraints = model.predict(X_train)
preds_wo_constraints
>>> array([0, 1, 1, ..., 0, 0, 0])
谁能帮我解释一下为什么以及如何解决这个问题?
train()
在 LightGBM Python 包中生成一个 lightgbm.Booster
对象。
对于二元class化,lightgbm.Booster.predict()
默认returns目标等于1的预测概率。
考虑以下使用 lightgbm==3.3.2
和 Python 3.8.12
的最小可重现示例
import lightgbm as lgb
from sklearn.datasets import make_blobs
X, y = make_blobs(
n_samples=1000,
n_features=5,
centers=2,
random_state=708
)
params = {
"objective": "binary",
"min_data_in_leaf": 5,
"min_data_in_bin": 5,
"seed": 708
}
bst = lgb.train(
params=params,
train_set=lgb.Dataset(data=X, label=y),
num_boost_round=5
)
preds = bst.predict(X)
preds[:10]
array([0.29794759, 0.70205241, 0.70205241, 0.70205241, 0.29794759,
0.29794759, 0.29794759, 0.29794759, 0.70205241, 0.29794759])
这些是目标值为1
的概率。
在 lightgbm
Python 包的 scikit-learn 接口中,训练生成 lightgbm.LGBMClassifier
.
的一个实例
对于二进制class化,lightgbm.LGBMClassifier.predict()
returns预测class.
clf = lgb.LGBMClassifier(**params)
clf.fit(X, y)
preds_sklearn = clf.predict(X)
preds_sklearn[:10]
array([0, 1, 1, 1, 0, 0, 0, 0, 1, 0])
explain why
scikit-learn
要求 classifier 从他们的 predict()
方法中产生预测的 classes。
scikit-learn
有非常严格的编写自定义估计器的标准,这些估计器应该与 scikit-learn
的功能兼容。这些在 "Developing scikit-learn estimators". The "Glossary of Common Terms and API Elements" linked from that guide says that the predict()
method for scikit-learn
estimators must product predictions "in the same target space used in fitting", which for classification means "one of the values in the classifier’s classes_
attribute" (docs link).
中有描述。
lightgbm.train()
是一个 lower-level 接口,其目标是提供对 LightGBM 的高性能、灵活的控制。它产生 Booster
和 Booster.predict()
产生概率以允许用户的代码选择它想用这些概率做什么(例如,将它们转换为具有自定义阈值的 classes,将它们用作一些 post-processing 代码的样本权重)。
how to solve this problem?
要将预测的二进制 class 化概率转换为预测的 classes,请将这些概率与阈值进行比较。
pred_class = (preds > 0.5).astype("int")
pred_class[:10]
array([0, 1, 1, 1, 0, 0, 0, 0, 1, 0])
我用 LightGBM 创建了一个二元分类模型:
#Dataset
y_train = data_train['Label']
X_train = data_train.drop(['Label'], axis=1)
y_test = data_test['Label']
X_test = data_test.drop(['Label'], axis=1)
train_data = lgb.Dataset(data=X_train, label=y_train)
test_data = lgb.Dataset(data=X_test, label=y_test)
#Setting default parameters
params_wo_constraints = {
'objective': 'binary',
'boosting_type': 'gbdt',
'metric': {'binary_logloss', 'auc'},
'num_leaves': 32,
'max_depth ': 5,
'min_data_in_leaf': 100,
'seed': 42,
'bagging_seed': 42,
'feature_fraction_seed': 42,
'drop_seed': 42,
'data_random_seed': 42
}
#Model training
evals_result = {}
model_wo_constraints = lgb.train(
params=params_wo_constraints,
train_set=train_data,
)
#Prediction
train_preds_wo_constraints = model_wo_constraints.predict(X_train)
test_preds_wo_constraints = model_wo_constraints.predict(X_test)
但是train_preds_wo_constraints的值不是0和1:
>>> array([7.02862608e-02, 7.02498237e-01, 4.85224849e-01, ...,
4.00079287e-04, 1.76385121e-01, 2.09733409e-01])
我试过 sklearn API 效果很好
model = lgb.LGBMClassifier(learning_rate=0.09,max_depth=5,random_state=42)
model.fit(X_train,y_train,eval_set=[(X_test,y_test),(X_train,y_train)],
verbose=20,eval_metric='logloss')
preds_wo_constraints = model.predict(X_train)
preds_wo_constraints
>>> array([0, 1, 1, ..., 0, 0, 0])
谁能帮我解释一下为什么以及如何解决这个问题?
train()
在 LightGBM Python 包中生成一个 lightgbm.Booster
对象。
对于二元class化,lightgbm.Booster.predict()
默认returns目标等于1的预测概率。
考虑以下使用 lightgbm==3.3.2
和 Python 3.8.12
import lightgbm as lgb
from sklearn.datasets import make_blobs
X, y = make_blobs(
n_samples=1000,
n_features=5,
centers=2,
random_state=708
)
params = {
"objective": "binary",
"min_data_in_leaf": 5,
"min_data_in_bin": 5,
"seed": 708
}
bst = lgb.train(
params=params,
train_set=lgb.Dataset(data=X, label=y),
num_boost_round=5
)
preds = bst.predict(X)
preds[:10]
array([0.29794759, 0.70205241, 0.70205241, 0.70205241, 0.29794759,
0.29794759, 0.29794759, 0.29794759, 0.70205241, 0.29794759])
这些是目标值为1
的概率。
在 lightgbm
Python 包的 scikit-learn 接口中,训练生成 lightgbm.LGBMClassifier
.
对于二进制class化,lightgbm.LGBMClassifier.predict()
returns预测class.
clf = lgb.LGBMClassifier(**params)
clf.fit(X, y)
preds_sklearn = clf.predict(X)
preds_sklearn[:10]
array([0, 1, 1, 1, 0, 0, 0, 0, 1, 0])
explain why
scikit-learn
要求 classifier 从他们的 predict()
方法中产生预测的 classes。
scikit-learn
有非常严格的编写自定义估计器的标准,这些估计器应该与 scikit-learn
的功能兼容。这些在 "Developing scikit-learn estimators". The "Glossary of Common Terms and API Elements" linked from that guide says that the predict()
method for scikit-learn
estimators must product predictions "in the same target space used in fitting", which for classification means "one of the values in the classifier’s classes_
attribute" (docs link).
lightgbm.train()
是一个 lower-level 接口,其目标是提供对 LightGBM 的高性能、灵活的控制。它产生 Booster
和 Booster.predict()
产生概率以允许用户的代码选择它想用这些概率做什么(例如,将它们转换为具有自定义阈值的 classes,将它们用作一些 post-processing 代码的样本权重)。
how to solve this problem?
要将预测的二进制 class 化概率转换为预测的 classes,请将这些概率与阈值进行比较。
pred_class = (preds > 0.5).astype("int")
pred_class[:10]
array([0, 1, 1, 1, 0, 0, 0, 0, 1, 0])