sklearn 中的决策树：序数数据仍然是一个严重的问题

Question

我想在这里展示一个示例并寻求解决方案。这里有很多与决策树相关的查询，以及关于选择序数与分类数据等。我的示例如下代码所示：

from sklearn import tree
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

c1=pd.Series([0,1,2,2,2,0,1,2,0,1,2])
c2=pd.Series([0,1,1,2,0,1,0,0,2,1,1])
c3=pd.Series([0,1,1,2,0,1,1,2,0,2,2])
c4=pd.Series([0,1,2,0,0,2,2,1,2,0,1])# My encoding : Veg:0, Glut:1, None:2
labels=pd.Series([0,0,0,0,1,1,1,0,0,1,1])

dnl=pd.concat([c1,c2,c3,c4],axis=1)
d=dnl.to_numpy()

clf = tree.DecisionTreeClassifier(criterion="entropy",random_state=420,max_depth=2,splitter='best')
clf_tree = clf.fit(d, labels.to_numpy())
print(clf_tree)

score=clf_tree.score(d,labels.to_numpy())
error=1-score
print("The error= ",error)

from sklearn.tree import plot_tree
fig, ax = plt.subplots(figsize=(6, 6)) #figsize value changes the size of plot
plot_tree(clf_tree,ax=ax)
plt.show()


from sklearn.metrics import confusion_matrix
yp=clf_tree.predict(dnl)
print(yp)
print(labels.to_numpy())
cm = confusion_matrix(labels, yp)
print("The confusion matrix= ",cm)

结果：

将 c4 编码（交换 1 和 0）更改为以下更改树！误分类错误更小！ c4=pd.Series([1,0,2,1,1,2,2,0,2,1,0])# Modified encoding: Veg:1, Glut:0,None:2

为什么决策树无法选择中间值作为条件？

Answer 1

我假设数字 0、1、2 代表不同的类别。那么你应该在构建树之前使用 one-hot 编码。结果将独立于类别的标签，例如'2' 将被视为与 '1' 类似。在您的设置中，“2”将大于“1”，大于“0”，这意味着类别有顺序。

编辑：

from sklearn import tree
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.preprocessing import OneHotEncoder

enc= OneHotEncoder(sparse=False)

c1=pd.Series(['0','1','2','2','2','0','1','2','0','1','2'])
c2=pd.Series(['0','1','1','2','0','1','0','0','2','1','1'])
c3=pd.Series(['0','1','1','2','0','1','1','2','0','2','2'])
c4=pd.Series(['0','1','2','0','0','2','2','1','2','0','1'])# My encoding : Veg:0, Glut:1, None:2
labels=pd.Series(['0','0','0','0','1','1','1','0','0','1','1'])

dnl=pd.concat([c1,c2,c3,c4],axis=1)
dnl=dnl.to_numpy()

enc.fit(dnl)

dnl=enc.transform(dnl)
clf = tree.DecisionTreeClassifier(criterion="entropy",random_state=420,max_depth=2,splitter='best')
clf_tree = clf.fit(dnl, labels.to_numpy()) #edited d to dnl 
print(clf_tree)

score=clf_tree.score(dnl,labels.to_numpy())
error=1-score
print("The error= ",error)

from sklearn.tree import plot_tree
fig, ax = plt.subplots(figsize=(6, 6)) #figsize value changes the size of plot
plot_tree(clf_tree,ax=ax)
plt.show()


from sklearn.metrics import confusion_matrix
yp=clf_tree.predict(dnl)
print(yp)
print(labels.to_numpy())
cm = confusion_matrix(labels, yp)
print("The confusion matrix= \n",cm)

sklearn 中的决策树：序数数据仍然是一个严重的问题

Decision Tree in sklearn: Ordinal data and still a serious issue

python

machine-learning

decision-tree

scikit-learn