在 3 个不同的分类器上使用相同的数据集输出相同的混淆 Matrices/Accuracy 分数

Question

我面临一个问题，即 3 个不同的分类器都在同一数据集（sklearn iris 数据集）上训练，输出完全相同的准确度分数和混淆矩阵。我已经给我的教授发了邮件，问这是否正常，如果不正常，她是否有任何建议，她给我的基本上是“这不正常，回去看看你的代码”。

从那时起我已经仔细查看了我的代码，但我似乎看不出发生了什么。我希望这里的人能够为我阐明一些问题，我将能够从这次经历中学到一些东西。

这是我的代码：

# Dataset
from sklearn import datasets

# Data Preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Classifiers
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression

# Performance Metrics
from sklearn.metrics import confusion_matrix, accuracy_score

if __name__ == '__main__':
    # Read dataset into memory.
    iris = datasets.load_iris()

    # Extract independent and dependent variables into variables.
    X = iris.data
    y = iris.target

    # Split training and test sets (70/30).
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)

    # Fit the scaler to the training set, and transform both the training and test sets dependent
    # columns, which are all of them since none of the dependent variables contain categorical data.
    ss = StandardScaler()
    X_train = ss.fit_transform(X_train)
    X_test = ss.transform(X_test)

    # Create the classifiers.
    dt_classifier = DecisionTreeClassifier(random_state=0)
    svm_classifier = SVC(kernel='rbf', random_state=0)
    lr_classifier = LogisticRegression(random_state=0)

    # Fit the classifiers to the training data.
    dt_classifier.fit(X_train, y_train)
    svm_classifier.fit(X_train, y_train)
    lr_classifier.fit(X_train, y_train)

    # Predict using the now trained classifiers.
    dt_y_pred = dt_classifier.predict(X_test)
    svm_y_pred = svm_classifier.predict(X_test)
    lr_y_pred = lr_classifier.predict(X_test)

    # Create confusion matrices using the predicted results and the actual results from the test set.
    dt_cm = confusion_matrix(y_test, dt_y_pred)
    svm_cm = confusion_matrix(y_test, svm_y_pred)
    lr_cm = confusion_matrix(y_test, lr_y_pred)

    # Calculate accuracy scores using the predicted results and the actual results from the test set.
    dt_score = accuracy_score(y_test, dt_y_pred)
    svm_score = accuracy_score(y_test, svm_y_pred)
    lr_score = accuracy_score(y_test, lr_y_pred)

    # Print confusion matrices and accuracy scores for each classifier.

    print('--- Decision Tree Classifier ---')
    print(f'Confusion Matrix:\n{dt_cm}')
    print(f'Accuracy Score:{dt_score}\n')

    print('--- Support Vector Machine Classifier ---')
    print(f'Confusion Matrix:\n{svm_cm}')
    print(f'Accuracy Score:{svm_score}\n')

    print('--- Logistic Regression Classifier ---')
    print(f'Confusion Matrix:\n{lr_cm}')
    print(f'Accuracy Score:{lr_score}')

输出：

--- Decision Tree Classifier ---
Confusion Matrix:
[[16  0  0]
 [ 0 17  1]
 [ 0  0 11]]
Accuracy Score:0.9777777777777777

--- Support Vector Machine Classifier ---
Confusion Matrix:
[[16  0  0]
 [ 0 17  1]
 [ 0  0 11]]
Accuracy Score:0.9777777777777777

--- Logistic Regression Classifier ---
Confusion Matrix:
[[16  0  0]
 [ 0 17  1]
 [ 0  0 11]]
Accuracy Score:0.9777777777777777

如您所见，每个不同分类器的输出完全相同。如果有人能给我任何形式的帮助，我将不胜感激。

Answer 1

你的代码没有问题。

在以下情况下，结果的这种相似性并不意外：

数据比较“简单”
样本太小

这两个前提在这里都成立。虹膜数据是以使用现代 ML 算法（包括您在此处使用的算法）进行分类；这个，再加上你的测试集小得离谱（只有 45 个样本），让这样的结果不足为奇。

事实上，只需将数据拆分更改为使用 test_size=0.20，您将从所有 3 个模型中获得 1.0 的完美准确度。

不用担心。

在 3 个不同的分类器上使用相同的数据集输出相同的混淆 Matrices/Accuracy 分数

Using the Same Dataset on 3 Different Classifiers is Outputting The Same Confusion Matrices/Accuracy Scores

python

classification

machine-learning

scikit-learn

supervised-learning