在决策树中显示更多属性
Display more attributes in the decision tree
我目前正在使用以下代码查看决策树。有没有一种方法可以将一些计算字段也导出为输出?
例如,是否可以在每个节点显示输入属性的总和,即树叶中 'X' 数据数组的特征 1 的总和。
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data[:]
y = iris.target
#%%
from sklearn.tree import DecisionTreeClassifier
alg=DecisionTreeClassifier( max_depth=5,min_samples_leaf=2, max_leaf_nodes = 10)
alg.fit(X,y)
#%%
## View tree
import graphviz
from sklearn import tree
dot_data = tree.export_graphviz(alg,out_file=None, node_ids = True, proportion = True, class_names = True, filled = True, rounded = True)
graph = graphviz.Source(dot_data)
graph
github page. There are answers on this SO question and this scikit-learn documentation page 上的 scikit-learn 中有很多关于决策树的讨论,它们提供了让您入门的框架。排除所有链接后,这里有一些功能允许用户以通用的方式解决问题。这些功能可以很容易地修改,因为我不知道你的意思是 所有叶子 还是每个叶子单独。我的做法是后者。
第一个函数使用 apply
作为查找叶节点索引的廉价方法。没有必要实现你的要求,但我把它包括在内是为了方便,因为你提到你想调查叶节点,而叶节点索引可能是未知的 a priori.
def find_leaves(X, clf):
"""A cheap function to find leaves of a DecisionTreeClassifier
clf must be a fitted DecisionTreeClassifier
"""
return set(clf.apply(X))
示例结果:
find_leaves(X, alg)
{1, 7, 8, 9, 10, 11, 12}
下面的函数将return一个满足node
和feature
条件的值数组,其中node
是树中节点的索引您想要的值和 feature
是您想要来自 X
.
的列(或特征)
def node_feature_values(X, clf, node=0, feature=0, require_leaf=False):
"""this function will return an array of values
from the input array X. Array values will be limited to
1. samples that passed through <node>
2. and from the feature <feature>.
clf must be a fitted DecisionTreeClassifier
"""
leaf_ids = find_leaves(X, clf)
if (require_leaf and
node not in leaf_ids):
print("<require_leaf> is set, "
"select one of these nodes:\n{}".format(leaf_ids))
return
# a sparse array that contains node assignment by sample
node_indicator = clf.decision_path(X)
node_array = node_indicator.toarray()
# which samples at least passed through the node
samples_in_node_mask = node_array[:,node]==1
return X[samples_in_node_mask, feature]
应用于示例:
values_arr = node_feature_values(X, alg, node=12, feature=0, require_leaf=True)
array([6.3, 5.8, 7.1, 6.3, 6.5, 7.6, 7.3, 6.7, 7.2, 6.5, 6.4, 6.8, 5.7,
5.8, 6.4, 6.5, 7.7, 7.7, 6.9, 5.6, 7.7, 6.3, 6.7, 7.2, 6.1, 6.4,
7.4, 7.9, 6.4, 7.7, 6.3, 6.4, 6.9, 6.7, 6.9, 5.8, 6.8, 6.7, 6.7,
6.3, 6.5, 6.2, 5.9])
现在,用户可以针对给定特征对样本子集执行所需的任何数学运算。
i.e. sum of feature 1 from 'X' data array in the leafs of the tree.
print("There are {} total samples in this node, "
"{}% of the total".format(len(values_arr), len(values_arr) / float(len(X))*100))
print("Feature Sum: {}".format(values_arr.sum()))
There are 43 total samples in this node,28.666666666666668% of the total
Feature Sum: 286.69999999999993
更新
重新阅读问题后,这是我可以快速组合的唯一解决方案,不涉及修改 export.py. Code below still relies on previously defined functions. This code modifies the dot
string via pydot and networkx 的 scikit 源代码。
# Load the data from `dot_data` variable, which you defined.
import pydot
dot_graph = pydot.graph_from_dot_data(dot_data)[0]
import networkx as nx
MG = nx.nx_pydot.from_pydot(dot_graph)
# Select a `feature` and edit the `dot` string in `networkx`.
feature = 0
for n in find_leaves(X, alg):
nfv = node_feature_values(X, alg, node=n, feature=feature)
MG.node[str(n)]['label'] = MG.node[str(n)]['label'] + "\nfeature_{} sum: {}".format(feature, nfv.sum())
# Export the `networkx` graph then plot using `graphviz.Source()`
new_dot_data = nx.nx_pydot.to_pydot(MG)
graph = graphviz.Source(new_dot_data.create_dot())
graph
请注意,所有叶子都具有特征 0
的 X
值的总和。
我认为完成您所要求的最好方法是修改 tree.py
and/or export.py
以原生支持此功能。
我目前正在使用以下代码查看决策树。有没有一种方法可以将一些计算字段也导出为输出?
例如,是否可以在每个节点显示输入属性的总和,即树叶中 'X' 数据数组的特征 1 的总和。
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data[:]
y = iris.target
#%%
from sklearn.tree import DecisionTreeClassifier
alg=DecisionTreeClassifier( max_depth=5,min_samples_leaf=2, max_leaf_nodes = 10)
alg.fit(X,y)
#%%
## View tree
import graphviz
from sklearn import tree
dot_data = tree.export_graphviz(alg,out_file=None, node_ids = True, proportion = True, class_names = True, filled = True, rounded = True)
graph = graphviz.Source(dot_data)
graph
github page. There are answers on this SO question and this scikit-learn documentation page 上的 scikit-learn 中有很多关于决策树的讨论,它们提供了让您入门的框架。排除所有链接后,这里有一些功能允许用户以通用的方式解决问题。这些功能可以很容易地修改,因为我不知道你的意思是 所有叶子 还是每个叶子单独。我的做法是后者。
第一个函数使用 apply
作为查找叶节点索引的廉价方法。没有必要实现你的要求,但我把它包括在内是为了方便,因为你提到你想调查叶节点,而叶节点索引可能是未知的 a priori.
def find_leaves(X, clf):
"""A cheap function to find leaves of a DecisionTreeClassifier
clf must be a fitted DecisionTreeClassifier
"""
return set(clf.apply(X))
示例结果:
find_leaves(X, alg)
{1, 7, 8, 9, 10, 11, 12}
下面的函数将return一个满足node
和feature
条件的值数组,其中node
是树中节点的索引您想要的值和 feature
是您想要来自 X
.
def node_feature_values(X, clf, node=0, feature=0, require_leaf=False):
"""this function will return an array of values
from the input array X. Array values will be limited to
1. samples that passed through <node>
2. and from the feature <feature>.
clf must be a fitted DecisionTreeClassifier
"""
leaf_ids = find_leaves(X, clf)
if (require_leaf and
node not in leaf_ids):
print("<require_leaf> is set, "
"select one of these nodes:\n{}".format(leaf_ids))
return
# a sparse array that contains node assignment by sample
node_indicator = clf.decision_path(X)
node_array = node_indicator.toarray()
# which samples at least passed through the node
samples_in_node_mask = node_array[:,node]==1
return X[samples_in_node_mask, feature]
应用于示例:
values_arr = node_feature_values(X, alg, node=12, feature=0, require_leaf=True)
array([6.3, 5.8, 7.1, 6.3, 6.5, 7.6, 7.3, 6.7, 7.2, 6.5, 6.4, 6.8, 5.7,
5.8, 6.4, 6.5, 7.7, 7.7, 6.9, 5.6, 7.7, 6.3, 6.7, 7.2, 6.1, 6.4,
7.4, 7.9, 6.4, 7.7, 6.3, 6.4, 6.9, 6.7, 6.9, 5.8, 6.8, 6.7, 6.7,
6.3, 6.5, 6.2, 5.9])
现在,用户可以针对给定特征对样本子集执行所需的任何数学运算。
i.e. sum of feature 1 from 'X' data array in the leafs of the tree.
print("There are {} total samples in this node, "
"{}% of the total".format(len(values_arr), len(values_arr) / float(len(X))*100))
print("Feature Sum: {}".format(values_arr.sum()))
There are 43 total samples in this node,28.666666666666668% of the total
Feature Sum: 286.69999999999993
更新
重新阅读问题后,这是我可以快速组合的唯一解决方案,不涉及修改 export.py. Code below still relies on previously defined functions. This code modifies the dot
string via pydot and networkx 的 scikit 源代码。
# Load the data from `dot_data` variable, which you defined.
import pydot
dot_graph = pydot.graph_from_dot_data(dot_data)[0]
import networkx as nx
MG = nx.nx_pydot.from_pydot(dot_graph)
# Select a `feature` and edit the `dot` string in `networkx`.
feature = 0
for n in find_leaves(X, alg):
nfv = node_feature_values(X, alg, node=n, feature=feature)
MG.node[str(n)]['label'] = MG.node[str(n)]['label'] + "\nfeature_{} sum: {}".format(feature, nfv.sum())
# Export the `networkx` graph then plot using `graphviz.Source()`
new_dot_data = nx.nx_pydot.to_pydot(MG)
graph = graphviz.Source(new_dot_data.create_dot())
graph
请注意,所有叶子都具有特征 0
的 X
值的总和。
我认为完成您所要求的最好方法是修改 tree.py
and/or export.py
以原生支持此功能。