决策树中 sample_weight 和 min_samples_split 之间的交互
Interaction between sample_weight and min_samples_split in decision tree
在sklearn.ensemble.RandomForestClassifier中,如果我们同时定义sample_weight
和min_samples_split
,样本权重是否会影响min_samples_split。比如min_sample_split = 20,样本中数据点的权重都是2,那么有10个数据点满足min_sample_split
条件?
否,参见 the source; min_samples_split
does not take into consideration sample weights. Compare to min_samples_leaf
and its weighted cousin min_weight_fraction_leaf
(source)。
你的例子建议一个简单的实验来检查:
from sklearn.tree import DecisionTreeClassifier
import numpy as np
X = np.array([1, 2, 3]).reshape(-1, 1)
y = [0, 0, 1]
tree = DecisionTreeClassifier()
tree.fit(X, y)
print(len(tree.tree_.feature)) # number of nodes
# 3
tree.set_params(min_samples_split=10)
tree.fit(X, y)
print(len(tree.tree_.feature))
# 1
tree.set_params(min_samples_split=10)
tree.fit(X, y, sample_weight=[20, 20, 20])
print(len(tree.tree_.feature))
# 1; the sample weights don't count to make
# each sample "large" enough for min_samples_split
在sklearn.ensemble.RandomForestClassifier中,如果我们同时定义sample_weight
和min_samples_split
,样本权重是否会影响min_samples_split。比如min_sample_split = 20,样本中数据点的权重都是2,那么有10个数据点满足min_sample_split
条件?
否,参见 the source; min_samples_split
does not take into consideration sample weights. Compare to min_samples_leaf
and its weighted cousin min_weight_fraction_leaf
(source)。
你的例子建议一个简单的实验来检查:
from sklearn.tree import DecisionTreeClassifier
import numpy as np
X = np.array([1, 2, 3]).reshape(-1, 1)
y = [0, 0, 1]
tree = DecisionTreeClassifier()
tree.fit(X, y)
print(len(tree.tree_.feature)) # number of nodes
# 3
tree.set_params(min_samples_split=10)
tree.fit(X, y)
print(len(tree.tree_.feature))
# 1
tree.set_params(min_samples_split=10)
tree.fit(X, y, sample_weight=[20, 20, 20])
print(len(tree.tree_.feature))
# 1; the sample weights don't count to make
# each sample "large" enough for min_samples_split