在 3 个不同的分类器上使用相同的数据集输出相同的混淆 Matrices/Accuracy 分数

Using the Same Dataset on 3 Different Classifiers is Outputting The Same Confusion Matrices/Accuracy Scores

我面临一个问题,即 3 个不同的分类器都在同一数据集(sklearn iris 数据集)上训练,输出完全相同的准确度分数和混淆矩阵。我已经给我的教授发了邮件,问这是否正常,如果不正常,她是否有任何建议,她给我的基本上是“这不正常,回去看看你的代码”。

从那时起我已经仔细查看了我的代码,但我似乎看不出发生了什么。我希望这里的人能够为我阐明一些问题,我将能够从这次经历中学到一些东西。

这是我的代码:

# Dataset
from sklearn import datasets

# Data Preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Classifiers
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression

# Performance Metrics
from sklearn.metrics import confusion_matrix, accuracy_score

if __name__ == '__main__':
    # Read dataset into memory.
    iris = datasets.load_iris()

    # Extract independent and dependent variables into variables.
    X = iris.data
    y = iris.target

    # Split training and test sets (70/30).
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)

    # Fit the scaler to the training set, and transform both the training and test sets dependent
    # columns, which are all of them since none of the dependent variables contain categorical data.
    ss = StandardScaler()
    X_train = ss.fit_transform(X_train)
    X_test = ss.transform(X_test)

    # Create the classifiers.
    dt_classifier = DecisionTreeClassifier(random_state=0)
    svm_classifier = SVC(kernel='rbf', random_state=0)
    lr_classifier = LogisticRegression(random_state=0)

    # Fit the classifiers to the training data.
    dt_classifier.fit(X_train, y_train)
    svm_classifier.fit(X_train, y_train)
    lr_classifier.fit(X_train, y_train)

    # Predict using the now trained classifiers.
    dt_y_pred = dt_classifier.predict(X_test)
    svm_y_pred = svm_classifier.predict(X_test)
    lr_y_pred = lr_classifier.predict(X_test)

    # Create confusion matrices using the predicted results and the actual results from the test set.
    dt_cm = confusion_matrix(y_test, dt_y_pred)
    svm_cm = confusion_matrix(y_test, svm_y_pred)
    lr_cm = confusion_matrix(y_test, lr_y_pred)

    # Calculate accuracy scores using the predicted results and the actual results from the test set.
    dt_score = accuracy_score(y_test, dt_y_pred)
    svm_score = accuracy_score(y_test, svm_y_pred)
    lr_score = accuracy_score(y_test, lr_y_pred)

    # Print confusion matrices and accuracy scores for each classifier.

    print('--- Decision Tree Classifier ---')
    print(f'Confusion Matrix:\n{dt_cm}')
    print(f'Accuracy Score:{dt_score}\n')

    print('--- Support Vector Machine Classifier ---')
    print(f'Confusion Matrix:\n{svm_cm}')
    print(f'Accuracy Score:{svm_score}\n')

    print('--- Logistic Regression Classifier ---')
    print(f'Confusion Matrix:\n{lr_cm}')
    print(f'Accuracy Score:{lr_score}')

输出:

--- Decision Tree Classifier ---
Confusion Matrix:
[[16  0  0]
 [ 0 17  1]
 [ 0  0 11]]
Accuracy Score:0.9777777777777777

--- Support Vector Machine Classifier ---
Confusion Matrix:
[[16  0  0]
 [ 0 17  1]
 [ 0  0 11]]
Accuracy Score:0.9777777777777777

--- Logistic Regression Classifier ---
Confusion Matrix:
[[16  0  0]
 [ 0 17  1]
 [ 0  0 11]]
Accuracy Score:0.9777777777777777

如您所见,每个不同分类器的输出完全相同。如果有人能给我任何形式的帮助,我将不胜感激。

你的代码没有问题。

在以下情况下,结果的这种相似性并不意外:

  1. 数据比较“简单”
  2. 样本太小

这两个前提在这里都成立。虹膜数据是 以使用现代 ML 算法(包括您在此处使用的算法)进行分类;这个,再加上你的测试集小得离谱(只有 45 个样本),让这样的结果不足为奇。

事实上,只需将数据拆分更改为使用 test_size=0.20,您将从所有 3 个模型中获得 1.0 的完美准确度。

不用担心。