在 3 个不同的分类器上使用相同的数据集输出相同的混淆 Matrices/Accuracy 分数
Using the Same Dataset on 3 Different Classifiers is Outputting The Same Confusion Matrices/Accuracy Scores
我面临一个问题,即 3 个不同的分类器都在同一数据集(sklearn iris 数据集)上训练,输出完全相同的准确度分数和混淆矩阵。我已经给我的教授发了邮件,问这是否正常,如果不正常,她是否有任何建议,她给我的基本上是“这不正常,回去看看你的代码”。
从那时起我已经仔细查看了我的代码,但我似乎看不出发生了什么。我希望这里的人能够为我阐明一些问题,我将能够从这次经历中学到一些东西。
这是我的代码:
# Dataset
from sklearn import datasets
# Data Preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Classifiers
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
# Performance Metrics
from sklearn.metrics import confusion_matrix, accuracy_score
if __name__ == '__main__':
# Read dataset into memory.
iris = datasets.load_iris()
# Extract independent and dependent variables into variables.
X = iris.data
y = iris.target
# Split training and test sets (70/30).
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)
# Fit the scaler to the training set, and transform both the training and test sets dependent
# columns, which are all of them since none of the dependent variables contain categorical data.
ss = StandardScaler()
X_train = ss.fit_transform(X_train)
X_test = ss.transform(X_test)
# Create the classifiers.
dt_classifier = DecisionTreeClassifier(random_state=0)
svm_classifier = SVC(kernel='rbf', random_state=0)
lr_classifier = LogisticRegression(random_state=0)
# Fit the classifiers to the training data.
dt_classifier.fit(X_train, y_train)
svm_classifier.fit(X_train, y_train)
lr_classifier.fit(X_train, y_train)
# Predict using the now trained classifiers.
dt_y_pred = dt_classifier.predict(X_test)
svm_y_pred = svm_classifier.predict(X_test)
lr_y_pred = lr_classifier.predict(X_test)
# Create confusion matrices using the predicted results and the actual results from the test set.
dt_cm = confusion_matrix(y_test, dt_y_pred)
svm_cm = confusion_matrix(y_test, svm_y_pred)
lr_cm = confusion_matrix(y_test, lr_y_pred)
# Calculate accuracy scores using the predicted results and the actual results from the test set.
dt_score = accuracy_score(y_test, dt_y_pred)
svm_score = accuracy_score(y_test, svm_y_pred)
lr_score = accuracy_score(y_test, lr_y_pred)
# Print confusion matrices and accuracy scores for each classifier.
print('--- Decision Tree Classifier ---')
print(f'Confusion Matrix:\n{dt_cm}')
print(f'Accuracy Score:{dt_score}\n')
print('--- Support Vector Machine Classifier ---')
print(f'Confusion Matrix:\n{svm_cm}')
print(f'Accuracy Score:{svm_score}\n')
print('--- Logistic Regression Classifier ---')
print(f'Confusion Matrix:\n{lr_cm}')
print(f'Accuracy Score:{lr_score}')
输出:
--- Decision Tree Classifier ---
Confusion Matrix:
[[16 0 0]
[ 0 17 1]
[ 0 0 11]]
Accuracy Score:0.9777777777777777
--- Support Vector Machine Classifier ---
Confusion Matrix:
[[16 0 0]
[ 0 17 1]
[ 0 0 11]]
Accuracy Score:0.9777777777777777
--- Logistic Regression Classifier ---
Confusion Matrix:
[[16 0 0]
[ 0 17 1]
[ 0 0 11]]
Accuracy Score:0.9777777777777777
如您所见,每个不同分类器的输出完全相同。如果有人能给我任何形式的帮助,我将不胜感激。
你的代码没有问题。
在以下情况下,结果的这种相似性并不意外:
- 数据比较“简单”
- 样本太小
这两个前提在这里都成立。虹膜数据是 以使用现代 ML 算法(包括您在此处使用的算法)进行分类;这个,再加上你的测试集小得离谱(只有 45 个样本),让这样的结果不足为奇。
事实上,只需将数据拆分更改为使用 test_size=0.20
,您将从所有 3 个模型中获得 1.0 的完美准确度。
不用担心。
我面临一个问题,即 3 个不同的分类器都在同一数据集(sklearn iris 数据集)上训练,输出完全相同的准确度分数和混淆矩阵。我已经给我的教授发了邮件,问这是否正常,如果不正常,她是否有任何建议,她给我的基本上是“这不正常,回去看看你的代码”。
从那时起我已经仔细查看了我的代码,但我似乎看不出发生了什么。我希望这里的人能够为我阐明一些问题,我将能够从这次经历中学到一些东西。
这是我的代码:
# Dataset
from sklearn import datasets
# Data Preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Classifiers
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
# Performance Metrics
from sklearn.metrics import confusion_matrix, accuracy_score
if __name__ == '__main__':
# Read dataset into memory.
iris = datasets.load_iris()
# Extract independent and dependent variables into variables.
X = iris.data
y = iris.target
# Split training and test sets (70/30).
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)
# Fit the scaler to the training set, and transform both the training and test sets dependent
# columns, which are all of them since none of the dependent variables contain categorical data.
ss = StandardScaler()
X_train = ss.fit_transform(X_train)
X_test = ss.transform(X_test)
# Create the classifiers.
dt_classifier = DecisionTreeClassifier(random_state=0)
svm_classifier = SVC(kernel='rbf', random_state=0)
lr_classifier = LogisticRegression(random_state=0)
# Fit the classifiers to the training data.
dt_classifier.fit(X_train, y_train)
svm_classifier.fit(X_train, y_train)
lr_classifier.fit(X_train, y_train)
# Predict using the now trained classifiers.
dt_y_pred = dt_classifier.predict(X_test)
svm_y_pred = svm_classifier.predict(X_test)
lr_y_pred = lr_classifier.predict(X_test)
# Create confusion matrices using the predicted results and the actual results from the test set.
dt_cm = confusion_matrix(y_test, dt_y_pred)
svm_cm = confusion_matrix(y_test, svm_y_pred)
lr_cm = confusion_matrix(y_test, lr_y_pred)
# Calculate accuracy scores using the predicted results and the actual results from the test set.
dt_score = accuracy_score(y_test, dt_y_pred)
svm_score = accuracy_score(y_test, svm_y_pred)
lr_score = accuracy_score(y_test, lr_y_pred)
# Print confusion matrices and accuracy scores for each classifier.
print('--- Decision Tree Classifier ---')
print(f'Confusion Matrix:\n{dt_cm}')
print(f'Accuracy Score:{dt_score}\n')
print('--- Support Vector Machine Classifier ---')
print(f'Confusion Matrix:\n{svm_cm}')
print(f'Accuracy Score:{svm_score}\n')
print('--- Logistic Regression Classifier ---')
print(f'Confusion Matrix:\n{lr_cm}')
print(f'Accuracy Score:{lr_score}')
输出:
--- Decision Tree Classifier ---
Confusion Matrix:
[[16 0 0]
[ 0 17 1]
[ 0 0 11]]
Accuracy Score:0.9777777777777777
--- Support Vector Machine Classifier ---
Confusion Matrix:
[[16 0 0]
[ 0 17 1]
[ 0 0 11]]
Accuracy Score:0.9777777777777777
--- Logistic Regression Classifier ---
Confusion Matrix:
[[16 0 0]
[ 0 17 1]
[ 0 0 11]]
Accuracy Score:0.9777777777777777
如您所见,每个不同分类器的输出完全相同。如果有人能给我任何形式的帮助,我将不胜感激。
你的代码没有问题。
在以下情况下,结果的这种相似性并不意外:
- 数据比较“简单”
- 样本太小
这两个前提在这里都成立。虹膜数据是
事实上,只需将数据拆分更改为使用 test_size=0.20
,您将从所有 3 个模型中获得 1.0 的完美准确度。
不用担心。