为什么我再次执行时 confusion_matrix 不一样?
why the confusion_matrix is different when I execute it again?
我想知道为什么confusion_matrix会在我第二次执行时发生变化,是否可以避免。准确的说是第一次得到了[[53445 597] [958 5000]],但是再次执行的时候得到了[[52556 1486][805 5153]]
# get the data from dataset and split into training-set and test-set
mnist = fetch_openml('mnist_784')
X, y = mnist['data'], mnist['target']
X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]
# make the data random
shuffle_index = np.random.permutation(60000)
X_train, y_train = X_train[shuffle_index], y_train[shuffle_index]
# true for all y_train='2', false for all others
y_train_2 = (y_train == '2')
y_test_2 = (y_test == '2')
# train the data with a label of T/F depends on whether the data is 2
# I use the random_state as 0, so it will not change, am I right?
sgd_clf = SGDClassifier(random_state=0)
sgd_clf.fit(X_train, y_train_2)
# get the confusion_matrix
y_train_pred = cross_val_predict(sgd_clf, X_train, y_train_2, cv=3)
print('confusion_matrix is', confusion_matrix(y_train_2, y_train_pred))
您在每个 运行 (shuffle_index
) 上使用不同的数据 - 因此没有理由让 ML 运行 和产生的混淆矩阵完全相同 - 尽管结果如果算法做得很好,应该很接近。
要消除随机性,请指定索引:
shuffle_index = np.arange(60000) #Rather "not_shuffled_index"
或使用相同的种子:
np.random.seed(1) #Or any number
shuffle_index = np.random.permutation(60000) #Will be the same for a given seed
我想知道为什么confusion_matrix会在我第二次执行时发生变化,是否可以避免。准确的说是第一次得到了[[53445 597] [958 5000]],但是再次执行的时候得到了[[52556 1486][805 5153]]
# get the data from dataset and split into training-set and test-set
mnist = fetch_openml('mnist_784')
X, y = mnist['data'], mnist['target']
X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]
# make the data random
shuffle_index = np.random.permutation(60000)
X_train, y_train = X_train[shuffle_index], y_train[shuffle_index]
# true for all y_train='2', false for all others
y_train_2 = (y_train == '2')
y_test_2 = (y_test == '2')
# train the data with a label of T/F depends on whether the data is 2
# I use the random_state as 0, so it will not change, am I right?
sgd_clf = SGDClassifier(random_state=0)
sgd_clf.fit(X_train, y_train_2)
# get the confusion_matrix
y_train_pred = cross_val_predict(sgd_clf, X_train, y_train_2, cv=3)
print('confusion_matrix is', confusion_matrix(y_train_2, y_train_pred))
您在每个 运行 (shuffle_index
) 上使用不同的数据 - 因此没有理由让 ML 运行 和产生的混淆矩阵完全相同 - 尽管结果如果算法做得很好,应该很接近。
要消除随机性,请指定索引:
shuffle_index = np.arange(60000) #Rather "not_shuffled_index"
或使用相同的种子:
np.random.seed(1) #Or any number
shuffle_index = np.random.permutation(60000) #Will be the same for a given seed