使用 MNIST 数据集的 Sklearn SVC:数字 5 始终错误?
Sklearn SVC with MNIST Dataset: Consistently wrong with the digit 5?
我已经建立了一个非常简单的 SVC 来对 MNIST 数字进行分类。出于某种原因,分类器非常一致地 错误地 预测数字 5,但在尝试所有其他数字时它不会错过任何一个。有谁知道我是否可能设置错误,或者它是否真的无法预测数字 5?
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix
data = datasets.load_digits()
images = data.images
targets = data.target
# Split into train and test sets
images_train, images_test, imlabels_train, imlabels_test = train_test_split(images, targets, test_size=.2, shuffle=False)
# Re-shape data so that it's 2D
images_train = np.reshape(images_train, (np.shape(images_train)[0], 64))
images_test = np.reshape(images_test, (np.shape(images_test)[0], 64))
svm_classifier = SVC(gamma='auto').fit(images_train, imlabels_train)
number_correct_svc = 0
preds = []
for label_index in range(len(imlabels_test)):
pred = svm_classifier.predict(images_test[label_index].reshape(1,-1))
if pred[0] == imlabels_test[label_index]:
number_correct_svc += 1
preds.append(pred[0])
print("Support Vector Classifier...")
print(f"\tPercent correct for all test data: {100*number_correct_svc/len(imlabels_test)}%")
confusion_matrix(preds,imlabels_test)
这是生成的混淆矩阵:
array([[22, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[ 0, 15, 0, 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 15, 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 21, 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 0, 21, 0, 0, 0, 0, 0],
[13, 21, 20, 16, 16, 37, 23, 20, 31, 16],
[ 0, 0, 0, 0, 0, 0, 14, 0, 0, 0],
[ 0, 0, 0, 0, 0, 0, 0, 16, 0, 0],
[ 0, 0, 0, 0, 0, 0, 0, 0, 2, 0],
[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 21]], dtype=int64)
我一直在阅读 SVC 的 sklearn 页面,但不知道我做错了什么
更新:
我试过使用 SCV(gamma='scale') 似乎更合理。知道为什么 'auto' 不起作用仍然很高兴?
规模:
array([[34, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[ 0, 36, 0, 0, 0, 0, 0, 0, 1, 0],
[ 0, 0, 35, 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 27, 0, 0, 0, 0, 0, 1],
[ 1, 0, 0, 0, 34, 0, 0, 0, 0, 0],
[ 0, 0, 0, 2, 0, 37, 0, 0, 0, 1],
[ 0, 0, 0, 0, 0, 0, 37, 0, 0, 0],
[ 0, 0, 0, 2, 0, 0, 0, 35, 0, 1],
[ 0, 0, 0, 6, 1, 0, 0, 1, 31, 1],
[ 0, 0, 0, 0, 2, 0, 0, 0, 1, 33]], dtype=int64)
第二个问题比较好处理。问题是在 RBF 内核中,gamma 表示决策边界的摆动程度。 "wiggly" 是什么意思? gamma 的值越高,决策边界就越精确。 SVM 的决策边界。
if gamma='scale'
(default) is passed then it uses 1 / (n_features *X.var())
as value of gamma,
if ‘auto’, uses 1 / n_features
.
在第二种情况下,gamma 更高。对于 MNIST,标准偏差小于 1。因此,第二个决策边界比前一个案例更精确,给出了更好的结果。
我已经建立了一个非常简单的 SVC 来对 MNIST 数字进行分类。出于某种原因,分类器非常一致地 错误地 预测数字 5,但在尝试所有其他数字时它不会错过任何一个。有谁知道我是否可能设置错误,或者它是否真的无法预测数字 5?
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix
data = datasets.load_digits()
images = data.images
targets = data.target
# Split into train and test sets
images_train, images_test, imlabels_train, imlabels_test = train_test_split(images, targets, test_size=.2, shuffle=False)
# Re-shape data so that it's 2D
images_train = np.reshape(images_train, (np.shape(images_train)[0], 64))
images_test = np.reshape(images_test, (np.shape(images_test)[0], 64))
svm_classifier = SVC(gamma='auto').fit(images_train, imlabels_train)
number_correct_svc = 0
preds = []
for label_index in range(len(imlabels_test)):
pred = svm_classifier.predict(images_test[label_index].reshape(1,-1))
if pred[0] == imlabels_test[label_index]:
number_correct_svc += 1
preds.append(pred[0])
print("Support Vector Classifier...")
print(f"\tPercent correct for all test data: {100*number_correct_svc/len(imlabels_test)}%")
confusion_matrix(preds,imlabels_test)
这是生成的混淆矩阵:
array([[22, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[ 0, 15, 0, 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 15, 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 21, 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 0, 21, 0, 0, 0, 0, 0],
[13, 21, 20, 16, 16, 37, 23, 20, 31, 16],
[ 0, 0, 0, 0, 0, 0, 14, 0, 0, 0],
[ 0, 0, 0, 0, 0, 0, 0, 16, 0, 0],
[ 0, 0, 0, 0, 0, 0, 0, 0, 2, 0],
[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 21]], dtype=int64)
我一直在阅读 SVC 的 sklearn 页面,但不知道我做错了什么
更新:
我试过使用 SCV(gamma='scale') 似乎更合理。知道为什么 'auto' 不起作用仍然很高兴? 规模:
array([[34, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[ 0, 36, 0, 0, 0, 0, 0, 0, 1, 0],
[ 0, 0, 35, 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 27, 0, 0, 0, 0, 0, 1],
[ 1, 0, 0, 0, 34, 0, 0, 0, 0, 0],
[ 0, 0, 0, 2, 0, 37, 0, 0, 0, 1],
[ 0, 0, 0, 0, 0, 0, 37, 0, 0, 0],
[ 0, 0, 0, 2, 0, 0, 0, 35, 0, 1],
[ 0, 0, 0, 6, 1, 0, 0, 1, 31, 1],
[ 0, 0, 0, 0, 2, 0, 0, 0, 1, 33]], dtype=int64)
第二个问题比较好处理。问题是在 RBF 内核中,gamma 表示决策边界的摆动程度。 "wiggly" 是什么意思? gamma 的值越高,决策边界就越精确。 SVM 的决策边界。
if
gamma='scale'
(default) is passed then it uses1 / (n_features *X.var())
as value of gamma,if ‘auto’, uses
1 / n_features
.
在第二种情况下,gamma 更高。对于 MNIST,标准偏差小于 1。因此,第二个决策边界比前一个案例更精确,给出了更好的结果。