Python

Question

我正在尝试使用 Python 模块 lightgbm 来拟合单个决策树。但是，我发现输出有点奇怪。我有 15 个解释变量，数值响应变量具有以下特征：

count    653.000000
mean      31.503813
std       11.838267
min       13.750000
25%       22.580000
50%       28.420000
75%       38.250000
max       76.750000
Name: X2, dtype: float64

我执行以下操作来适应树：我首先构建 Dataset 对象

df_train = lightgbm.Dataset(
    df, # The data 
    label = df[response], # The response series
    feature_name = features, # A list with names of all explanatory variables
    categorical_feature = categorical_vars # A list with names of the categorical ones
)

接下来，我定义参数并拟合模型：

param = {
    # make it a single tree:
    'objective': 'regression',
    'bagging_freq':0,  # Disable bagging
    'feature_fraction':1, # don't randomly select features. consider all.
    'num_trees': 1,
    
    # tuning parameters
    'max_leaves': 20,
    'max_depth': -1,
    'min_data_in_leaf': 20
}

model = lightgbm.train(param, df_train)

从模型中我将树的叶子提取为：

tree = model.trees_to_dataframe()[[
'right_child',
    'node_depth',
    'value',
    'count']]

leaves = tree[tree.right_child.isnull()]

print(leaves)

   right_child  node_depth      value  count
5         None           6  29.957982     20
6         None           6  30.138253     28
8         None           6  30.269373     34
9         None           6  30.404353     38
12        None           6  30.528705     33
13        None           6  30.651690     62
14        None           5  30.842856     59
17        None           5  31.080432     51
19        None           6  31.232860     21
20        None           6  31.358547     26
22        None           5  31.567571     43
23        None           5  31.795345     46
28        None           6  32.034321     27
29        None           6  32.247890     24
31        None           6  32.420886     22
32        None           6  32.594289     21
34        None           5  32.920932     20
35        None           5  33.210205     22
37        None           4  33.809376     36
38        None           4  34.887632     20

现在，如果您查看这些值，它们的范围从（大约）30 到 35。这远未捕获响应变量的分布（如上所示 min = 13.75 和 max = 76.75）。

谁能给我解释一下这是怎么回事？

根据接受的答案跟进：

我尝试将 'learning_rate':1 和 'min_data_in_bin':1 添加到参数 dict 中，这导致了以下树：

   right_child  node_depth      value  count
5         None           6  16.045500     20
6         None           6  17.824074     27
8         None           6  19.157500     36
9         None           6  20.529730     37
12        None           6  21.805834     36
13        None           6  23.048387     62
14        None           5  24.975263     57
17        None           5  27.335385     52
19        None           6  29.006800     25
20        None           6  30.234286     21
22        None           5  32.221591     44
23        None           5  34.472272     44
28        None           6  36.808889     27
29        None           6  38.944583     24
31        None           6  40.674546     22
32        None           6  42.408572     21
34        None           5  45.675000     20
35        None           5  48.567728     22
37        None           4  54.559445     36
38        None           4  65.341999     20

这更可取。这意味着，我们现在可以使用 lightgbm 来模拟具有分类特征的单个决策树的行为。与 sklearn 不同，lightgbm 尊重“真实”分类变量，而在 sklearn 中需要一次性编码所有分类变量，这可能会变得非常糟糕；参见 this kaggle post。

Answer 1

正如您所知，LightGBM 做了一些技巧来加快速度。其中之一是特征分箱，其中将特征值分配给分箱以减少可能的拆分数量。根据 default，这个数字是 3，因此，例如，如果您有 100 个样本，您将有大约 34 个 bin。

这里使用单树的另一个重要的事情是LightGBM默认做提升，这意味着它会从一个初始分数开始，并尝试逐渐改进它。这种逐渐变化由 learning_rate 控制，默认情况下为 0.1，因此每棵树的预测都乘以该数字并添加到当前分数。

最后要考虑的是树的大小由 num_leaves 控制，默认为 31。如果你想完全种植树，你必须将这个数字设置为你的样本数。

所以如果你想在 LightGBM 中复制一个 full-grown 决策树，你必须调整这些参数。这是一个例子：

import lightgbm as lgb
import numpy as np
import pandas as pd

X = np.linspace(1, 2, 100)[:, None]
y = X[:, 0]**2
ds = lgb.Dataset(X, y)
params = {'num_leaves': 100, 'min_child_samples': 1, 'min_data_in_bin': 1, 'learning_rate': 1}
bst = lgb.train(params, ds, num_boost_round=1)
print(pd.concat([
    bst.trees_to_dataframe().loc[lambda x: x['left_child'].isnull(), 'value'].describe().rename('leaves'),
    pd.Series(y).describe().rename('y'),
], axis=1))

	leaves	y
count	100	100
mean	2.33502	2.33502
std	0.882451	0.882451
min	1	1
25%	1.56252	1.56252
50%	2.25003	2.25003
75%	3.06252	3.06252
max	4	4

话虽如此，如果您正在寻找决策树，使用 scikit-learn 的：

会更容易

from sklearn.tree import DecisionTreeRegressor

tree = DecisionTreeRegressor().fit(X, y)
np.allclose(bst.predict(X), tree.predict(X))
# True

Python - lightgbm 中具有奇数值的决策树

Python - decision tree in lightgbm with odd values

lightgbm