通过 sklearn 使用 id3 算法训练决策树
Training a decision tree using id3 algorithm by sklearn
我正在尝试使用 id3 算法训练决策树。
目的是获取所选特征的索引,估计出现次数,并构建总混淆矩阵。
该算法应将数据集拆分为训练集和测试集,并使用 4 折交叉验证。
我是这个学科的新手,我已经阅读了 sklearn 的教程和学习过程的理论,但我仍然很困惑。
我尝试过的事情:
from sklearn.model_selection import cross_val_predict,KFold,cross_val_score,
train_test_split, learning_curve
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=1)
clf = DecisionTreeClassifier(criterion='entropy', random_state=0)
clf.fit(X_train,y_train)
results = cross_val_score(estimator=clf, X=X_train, y=y_train, cv=4)
print("Accuracy: %0.2f (+/- %0.2f)" % (results.mean(), results.std()))
y_pred = cross_val_predict(estimator=clf, X=x, y=y, cv=4)
conf_mat = confusion_matrix(y,y_pred)
print(conf_mat)
dot_data = tree.export_graphviz(clf, out_file='tree.dot')
我有一些问题:
如何获取训练中使用的特征索引列表?我必须通过clf中的树吗?找不到任何 api 方法来检索它们。
我必须使用 'fit'、'cross_val_score' 和 'cross_val_predict' 吗?似乎他们所有人都在做某种学习过程,但我无法仅从其中一个中获得 clf、准确度和混淆矩阵。
是否必须使用测试集进行估计或数据集折叠的分区?
要检索训练过程中使用的特征列表,您可以通过以下方式从 x 中获取列:
feature_list = x.columns
如您所知,并非每个特征都可用于预测。你可以看到这个,在训练模型之后,使用
clf.feature_importances_
feature_list 中的特征索引与 feature_importances 列表中的相同。
如果使用交叉验证,则无法立即检索分数。
cross_val_score 达成交易,但获得分数的更好方法可能是使用 cross_validate。它的工作方式与 cross_val_score 相同,但您可以检索更多分数值,只需使用 make_score 创建您需要的每个分数并传递它,这里是一个示例:
from sklearn.model_selection import train_test_split, cross_validate
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, f1_score, precision_score, make_scorer, recall_score
import pandas as pd, numpy as np
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
dtc = DecisionTreeClassifier()
dtc_fit = dtc.fit(x_train, y_train)
def tn(y_true, y_pred): return confusion_matrix(y_true, y_pred)[0, 0]
def fp(y_true, y_pred): return confusion_matrix(y_true, y_pred)[0, 1]
def fn(y_true, y_pred): return confusion_matrix(y_true, y_pred)[1, 0]
def tp(y_true, y_pred): return confusion_matrix(y_true, y_pred)[1, 1]
scoring = {
'tp' : make_scorer(tp),
'tn' : make_scorer(tn),
'fp' : make_scorer(fp),
'fn' : make_scorer(fn),
'accuracy' : make_scorer(accuracy_score),
'precision': make_scorer(precision_score),
'f1_score' : make_scorer(f1_score),
'recall' : make_scorer(recall_score)
}
sc = cross_validate(dtc_fit, x_train, y_train, cv=5, scoring=scoring)
print("Accuracy: %0.2f (+/- %0.2f)" % (sc['test_accuracy'].mean(), sc['test_accuracy'].std() * 2))
print("Precision: %0.2f (+/- %0.2f)" % (sc['test_precision'].mean(), sc['test_precision'].std() * 2))
print("f1_score: %0.2f (+/- %0.2f)" % (sc['test_f1_score'].mean(), sc['test_f1_score'].std() * 2))
print("Recall: %0.2f (+/- %0.2f)" % (sc['test_recall'].mean(), sc['test_recall'].std() * 2), "\n")
stp = math.ceil(sc['test_tp'].mean())
stn = math.ceil(sc['test_tn'].mean())
sfp = math.ceil(sc['test_fp'].mean())
sfn = math.ceil(sc['test_fn'].mean())
confusion_matrix = pd.DataFrame(
[[stn, sfp], [sfn, stp]],
columns=['Predicted 0', 'Predicted 1'],
index=['True 0', 'True 1']
)
print(conf_m)
当您使用 cross_val 函数时,该函数本身会为测试和训练创建折叠。如果您想管理火车折叠和测试折叠,您可以使用 K_Fold class.
自行完成
如果你需要保持 class 平衡,总是需要 DecisionTreeClassifier 的良好评分,你必须使用 StratifiedKFold。如果您想随机打乱折叠中包含的值,可以使用 StratifiedShuffleSplit。举个例子:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, f1_score, precision_score, make_scorer, recall_score
import pandas as pd, numpy as np
precision = []; recall = []; f1score = []; accuracy = []
sss = StratifiedShuffleSplit(n_splits=10, test_size=0.2)
dtc = DecisionTreeClassifier()
for train_index, test_index in sss.split(X, y):
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
dtc.fit(X_train, y_train)
pred = dtc.predict(X_test)
precision.append(precision_score(y_test, pred))
recall.append(recall_score(y_test, pred))
f1score.append(f1_score(y_test, pred))
accuracy.append(accuracy_score(y_test, pred))
print("Accuracy: %0.2f (+/- %0.2f)" % (np.mean(accuracy),np.std(accuracy) * 2))
print("Precision: %0.2f (+/- %0.2f)" % (np.mean(precision),np.std(precision) * 2))
print("f1_score: %0.2f (+/- %0.2f)" % (np.mean(f1score),np.std(f1score) * 2))
print("Recall: %0.2f (+/- %0.2f)" % (np.mean(recall),np.std(recall) * 2))
我希望我已经回答了你需要的一切!
我正在尝试使用 id3 算法训练决策树。 目的是获取所选特征的索引,估计出现次数,并构建总混淆矩阵。
该算法应将数据集拆分为训练集和测试集,并使用 4 折交叉验证。
我是这个学科的新手,我已经阅读了 sklearn 的教程和学习过程的理论,但我仍然很困惑。
我尝试过的事情:
from sklearn.model_selection import cross_val_predict,KFold,cross_val_score,
train_test_split, learning_curve
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=1)
clf = DecisionTreeClassifier(criterion='entropy', random_state=0)
clf.fit(X_train,y_train)
results = cross_val_score(estimator=clf, X=X_train, y=y_train, cv=4)
print("Accuracy: %0.2f (+/- %0.2f)" % (results.mean(), results.std()))
y_pred = cross_val_predict(estimator=clf, X=x, y=y, cv=4)
conf_mat = confusion_matrix(y,y_pred)
print(conf_mat)
dot_data = tree.export_graphviz(clf, out_file='tree.dot')
我有一些问题:
如何获取训练中使用的特征索引列表?我必须通过clf中的树吗?找不到任何 api 方法来检索它们。
我必须使用 'fit'、'cross_val_score' 和 'cross_val_predict' 吗?似乎他们所有人都在做某种学习过程,但我无法仅从其中一个中获得 clf、准确度和混淆矩阵。
是否必须使用测试集进行估计或数据集折叠的分区?
要检索训练过程中使用的特征列表,您可以通过以下方式从 x 中获取列:
feature_list = x.columns
如您所知,并非每个特征都可用于预测。你可以看到这个,在训练模型之后,使用
clf.feature_importances_
feature_list 中的特征索引与 feature_importances 列表中的相同。
如果使用交叉验证,则无法立即检索分数。
cross_val_score 达成交易,但获得分数的更好方法可能是使用 cross_validate。它的工作方式与 cross_val_score 相同,但您可以检索更多分数值,只需使用 make_score 创建您需要的每个分数并传递它,这里是一个示例:from sklearn.model_selection import train_test_split, cross_validate from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, f1_score, precision_score, make_scorer, recall_score import pandas as pd, numpy as np x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2) dtc = DecisionTreeClassifier() dtc_fit = dtc.fit(x_train, y_train) def tn(y_true, y_pred): return confusion_matrix(y_true, y_pred)[0, 0] def fp(y_true, y_pred): return confusion_matrix(y_true, y_pred)[0, 1] def fn(y_true, y_pred): return confusion_matrix(y_true, y_pred)[1, 0] def tp(y_true, y_pred): return confusion_matrix(y_true, y_pred)[1, 1] scoring = { 'tp' : make_scorer(tp), 'tn' : make_scorer(tn), 'fp' : make_scorer(fp), 'fn' : make_scorer(fn), 'accuracy' : make_scorer(accuracy_score), 'precision': make_scorer(precision_score), 'f1_score' : make_scorer(f1_score), 'recall' : make_scorer(recall_score) } sc = cross_validate(dtc_fit, x_train, y_train, cv=5, scoring=scoring) print("Accuracy: %0.2f (+/- %0.2f)" % (sc['test_accuracy'].mean(), sc['test_accuracy'].std() * 2)) print("Precision: %0.2f (+/- %0.2f)" % (sc['test_precision'].mean(), sc['test_precision'].std() * 2)) print("f1_score: %0.2f (+/- %0.2f)" % (sc['test_f1_score'].mean(), sc['test_f1_score'].std() * 2)) print("Recall: %0.2f (+/- %0.2f)" % (sc['test_recall'].mean(), sc['test_recall'].std() * 2), "\n") stp = math.ceil(sc['test_tp'].mean()) stn = math.ceil(sc['test_tn'].mean()) sfp = math.ceil(sc['test_fp'].mean()) sfn = math.ceil(sc['test_fn'].mean()) confusion_matrix = pd.DataFrame( [[stn, sfp], [sfn, stp]], columns=['Predicted 0', 'Predicted 1'], index=['True 0', 'True 1'] ) print(conf_m)
当您使用 cross_val 函数时,该函数本身会为测试和训练创建折叠。如果您想管理火车折叠和测试折叠,您可以使用 K_Fold class.
自行完成 如果你需要保持 class 平衡,总是需要 DecisionTreeClassifier 的良好评分,你必须使用 StratifiedKFold。如果您想随机打乱折叠中包含的值,可以使用 StratifiedShuffleSplit。举个例子:from sklearn.tree import DecisionTreeClassifier from sklearn.model_selection import StratifiedShuffleSplit from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, f1_score, precision_score, make_scorer, recall_score import pandas as pd, numpy as np precision = []; recall = []; f1score = []; accuracy = [] sss = StratifiedShuffleSplit(n_splits=10, test_size=0.2) dtc = DecisionTreeClassifier() for train_index, test_index in sss.split(X, y): X_train, X_test = X.iloc[train_index], X.iloc[test_index] y_train, y_test = y.iloc[train_index], y.iloc[test_index] dtc.fit(X_train, y_train) pred = dtc.predict(X_test) precision.append(precision_score(y_test, pred)) recall.append(recall_score(y_test, pred)) f1score.append(f1_score(y_test, pred)) accuracy.append(accuracy_score(y_test, pred)) print("Accuracy: %0.2f (+/- %0.2f)" % (np.mean(accuracy),np.std(accuracy) * 2)) print("Precision: %0.2f (+/- %0.2f)" % (np.mean(precision),np.std(precision) * 2)) print("f1_score: %0.2f (+/- %0.2f)" % (np.mean(f1score),np.std(f1score) * 2)) print("Recall: %0.2f (+/- %0.2f)" % (np.mean(recall),np.std(recall) * 2))
我希望我已经回答了你需要的一切!