scikit learn - 决策树中的特征重要性计算
scikit learn - feature importance calculation in decision trees
我正在尝试了解如何计算 sci-kit 学习中决策树的特征重要性。之前有人问过这个问题,但我无法重现算法提供的结果。
例如:
from StringIO import StringIO
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree.export import export_graphviz
from sklearn.feature_selection import mutual_info_classif
X = [[1,0,0], [0,0,0], [0,0,1], [0,1,0]]
y = [1,0,1,1]
clf = DecisionTreeClassifier()
clf.fit(X, y)
feat_importance = clf.tree_.compute_feature_importances(normalize=False)
print("feat importance = " + str(feat_importance))
out = StringIO()
out = export_graphviz(clf, out_file='test/tree.dot')
结果特征重要性:
feat importance = [0.25 0.08333333 0.04166667]
并给出以下决策树:
现在,这个 answer 到一个类似的问题表明重要性计算为
其中 G 是节点杂质,在本例中是基尼杂质。据我了解,这是杂质减少。但是,对于功能 1,这应该是:
这 answer 表明重要性由到达节点的概率加权(近似于到达该节点的样本比例)。同样,对于功能 1,这应该是:
两个公式都提供了错误的结果。如何正确计算特征重要性?
我认为特征的重要性取决于实现,所以我们需要查看 scikit-learn 的文档。
The feature importances. The higher, the more important the feature. The importance of a feature is computed as the (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance
减少或加权信息增益定义为:
The weighted impurity decrease equation is the following:
N_t / N * (impurity - N_t_R / N_t * right_impurity
- N_t_L / N_t * left_impurity)
where N is the total number of samples, N_t is the number of samples at the current node, N_t_L is the number of samples in the left child, and N_t_R is the number of samples in the right child.
由于每个特征在您的案例中使用一次,特征信息必须等于上面的等式。
对于 X[2] :
feature_importance = (4 / 4) * (0.375 - (0.75 * 0.444)) = 0.042
对于 X[1] :
feature_importance = (3 / 4) * (0.444 - (2/3 * 0.5)) = 0.083
对于 X[0] :
feature_importance = (2 / 4) * (0.5) = 0.25
单个特征可以用于树的不同分支,特征重要性就是它对减少杂质的总贡献。
feature_importance += number_of_samples_at_parent_where_feature_is_used\*impurity_at_parent-left_child_samples\*impurity_left-right_child_samples\*impurity_right
杂质是 gini/entropy 值
normalized_importance = feature_importance/number_of_samples_root_node(total num of samples)
在上面例如:
feature_2_importance = 0.375*4-0.444*3-0*1 = 0.16799 ,
normalized = 0.16799/4(total_num_of_samples) = 0.04199
如果在其他分支中使用了 feature_2
,计算它在每个这样的 parent 节点的重要性并对值求和。
计算的特征重要性与库 return 编辑的特征重要性存在差异,因为我们使用的是图中看到的截断值。
相反,我们可以使用分类器的 'tree_' 属性访问所有必需的数据,该属性可用于探测所使用的特征、阈值、杂质、每个节点的样本数等。
例如:clf.tree_.feature
给出了所用功能的列表。负值表示它是叶节点。
类似地,clf.tree_.children_left/right
给出 clf.tree_.feature
左右的索引 children
使用上面的遍历树并在 clf.tree_.impurity & clf.tree_.weighted_n_node_samples
中使用相同的索引来获取每个节点的 gini/entropy 值和样本数及其 children.
def dt_feature_importance(model,normalize=True):
left_c = model.tree_.children_left
right_c = model.tree_.children_right
impurity = model.tree_.impurity
node_samples = model.tree_.weighted_n_node_samples
# Initialize the feature importance, those not used remain zero
feature_importance = np.zeros((model.tree_.n_features,))
for idx,node in enumerate(model.tree_.feature):
if node >= 0:
# Accumulate the feature importance over all the nodes where it's used
feature_importance[node]+=impurity[idx]*node_samples[idx]- \
impurity[left_c[idx]]*node_samples[left_c[idx]]-\
impurity[right_c[idx]]*node_samples[right_c[idx]]
# Number of samples at the root node
feature_importance/=node_samples[0]
if normalize:
normalizer = feature_importance.sum()
if normalizer > 0:
feature_importance/=normalizer
return feature_importance
此函数将 return 与 clf.tree_.compute_feature_importances(normalize=...)
编辑的 return 完全相同的值
根据重要性对特征进行排序
features = clf.tree_.feature[clf.tree_.feature>=0] # Feature number should not be negative, indicates a leaf node
sorted(zip(features,dt_feature_importance(clf,False)[features]),key=lambda x:x[1],reverse=True)
我正在尝试了解如何计算 sci-kit 学习中决策树的特征重要性。之前有人问过这个问题,但我无法重现算法提供的结果。
例如:
from StringIO import StringIO
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree.export import export_graphviz
from sklearn.feature_selection import mutual_info_classif
X = [[1,0,0], [0,0,0], [0,0,1], [0,1,0]]
y = [1,0,1,1]
clf = DecisionTreeClassifier()
clf.fit(X, y)
feat_importance = clf.tree_.compute_feature_importances(normalize=False)
print("feat importance = " + str(feat_importance))
out = StringIO()
out = export_graphviz(clf, out_file='test/tree.dot')
结果特征重要性:
feat importance = [0.25 0.08333333 0.04166667]
并给出以下决策树:
现在,这个 answer 到一个类似的问题表明重要性计算为
其中 G 是节点杂质,在本例中是基尼杂质。据我了解,这是杂质减少。但是,对于功能 1,这应该是:
这 answer 表明重要性由到达节点的概率加权(近似于到达该节点的样本比例)。同样,对于功能 1,这应该是:
两个公式都提供了错误的结果。如何正确计算特征重要性?
我认为特征的重要性取决于实现,所以我们需要查看 scikit-learn 的文档。
The feature importances. The higher, the more important the feature. The importance of a feature is computed as the (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance
减少或加权信息增益定义为:
The weighted impurity decrease equation is the following:
N_t / N * (impurity - N_t_R / N_t * right_impurity - N_t_L / N_t * left_impurity)
where N is the total number of samples, N_t is the number of samples at the current node, N_t_L is the number of samples in the left child, and N_t_R is the number of samples in the right child.
由于每个特征在您的案例中使用一次,特征信息必须等于上面的等式。
对于 X[2] :
feature_importance = (4 / 4) * (0.375 - (0.75 * 0.444)) = 0.042
对于 X[1] :
feature_importance = (3 / 4) * (0.444 - (2/3 * 0.5)) = 0.083
对于 X[0] :
feature_importance = (2 / 4) * (0.5) = 0.25
单个特征可以用于树的不同分支,特征重要性就是它对减少杂质的总贡献。
feature_importance += number_of_samples_at_parent_where_feature_is_used\*impurity_at_parent-left_child_samples\*impurity_left-right_child_samples\*impurity_right
杂质是 gini/entropy 值
normalized_importance = feature_importance/number_of_samples_root_node(total num of samples)
在上面例如:
feature_2_importance = 0.375*4-0.444*3-0*1 = 0.16799 ,
normalized = 0.16799/4(total_num_of_samples) = 0.04199
如果在其他分支中使用了 feature_2
,计算它在每个这样的 parent 节点的重要性并对值求和。
计算的特征重要性与库 return 编辑的特征重要性存在差异,因为我们使用的是图中看到的截断值。
相反,我们可以使用分类器的 'tree_' 属性访问所有必需的数据,该属性可用于探测所使用的特征、阈值、杂质、每个节点的样本数等。
例如:clf.tree_.feature
给出了所用功能的列表。负值表示它是叶节点。
类似地,clf.tree_.children_left/right
给出 clf.tree_.feature
左右的索引 children
使用上面的遍历树并在 clf.tree_.impurity & clf.tree_.weighted_n_node_samples
中使用相同的索引来获取每个节点的 gini/entropy 值和样本数及其 children.
def dt_feature_importance(model,normalize=True):
left_c = model.tree_.children_left
right_c = model.tree_.children_right
impurity = model.tree_.impurity
node_samples = model.tree_.weighted_n_node_samples
# Initialize the feature importance, those not used remain zero
feature_importance = np.zeros((model.tree_.n_features,))
for idx,node in enumerate(model.tree_.feature):
if node >= 0:
# Accumulate the feature importance over all the nodes where it's used
feature_importance[node]+=impurity[idx]*node_samples[idx]- \
impurity[left_c[idx]]*node_samples[left_c[idx]]-\
impurity[right_c[idx]]*node_samples[right_c[idx]]
# Number of samples at the root node
feature_importance/=node_samples[0]
if normalize:
normalizer = feature_importance.sum()
if normalizer > 0:
feature_importance/=normalizer
return feature_importance
此函数将 return 与 clf.tree_.compute_feature_importances(normalize=...)
根据重要性对特征进行排序
features = clf.tree_.feature[clf.tree_.feature>=0] # Feature number should not be negative, indicates a leaf node
sorted(zip(features,dt_feature_importance(clf,False)[features]),key=lambda x:x[1],reverse=True)