使用 sklearn 的决策树分类器 100% 准确

Question

我正在使用 sklearn 的决策树分类器，但我得到 100% 的分数，但我不知道哪里出了问题。我已经测试了 svm 和 knn，两者都给出了 60% 到 80% 的准确率，看起来还不错。这是我的代码：

    from sklearn.tree import DecisionTreeClassifier
    maxScore = 0
    index = 0
    Depths = [1, 5, 10, 20, 40]
    for i,d in enumerate(Depths):
        clf1 = DecisionTreeClassifier(max_depth=d)
        score = cross_val_score(clf1, X_train, Y_train, cv=10).mean()     
        index = i if(score > maxScore) else index
        maxScore = max(score, maxScore)
        print('The cross val score for Decision Tree classifier (max_depth=' + str(d) + ') is ' + 
        str(score))

    d = Depths[index]
    print()
    print("So the best value for max_depth parameter is " + str(d))
    print()

    # Classifying
    clf1 = DecisionTreeClassifier(max_depth=d)
    clf1.fit(X_train, Y_train)
    preds = clf1.predict(X_valid)
    print(" The accuracy obtained using Decision tree classifier is {0:.8f}%".format(100* 
    (clf1.score(X_valid, Y_valid))))

这是输出：决策树分类器 (max_depth=1) 的 cross val 分数是 1.0

决策树分类器 (max_depth=5) 的交叉值得分为 0.9996212121212121

决策树分类器 (max_depth=10) 的交叉验证分数是 1.0

决策树分类器 (max_depth=20) 的交叉验证分数是 1.0

决策树分类器 (max_depth=40) 的交叉验证分数是 0.9996212121212121

所以 max_depth 参数的最佳值是 1

使用决策树分类器获得的准确率为 100.00000000%

Answer 1

我认为有一个明显的结论：您的标签与某些特征高度相关，或者至少与其中一个特征高度相关。可能你的数据不是很好

无论如何，您可以检查决策树模型的单个特征拆分如何影响模型预测。

使用 model.feature_importances_ 属性查看 'important' 特征如何用于模型预测。

查看文档 Decision Tree Classifier。

如果你仍然认为你的模型预测不够好，我建议你改变你的模型，使用不同方法的模型。至少如果你必须使用决策树，你可以尝试 Random Forest Classifier.

这是一个ensemble model.The basic idea of ensemble learning is that the final model prediction is based on multiple weaker model predictions, weak learners. Check main approaches of making an ensemble models。

在随机森林分类器的情况下，弱学习器模型是深度较小的决策树。决策树仅使用少量特征进行预测，每次选择特征时 randomly.Number 所选特征是一个超参数，因此需要对其进行调整。

查看链接和其他教程以获取更多信息。

使用 sklearn 的决策树分类器 100% 准确

100% accuracy with decision tree classifier using sklearn

python

machine-learning

decision-tree

scikit-learn