为训练集计算 confusion_matrix
Calculate confusion_matrix for Training set
我是机器学习的新手。最近,我学会了如何计算 confusion_matrix
for Test set
of KNN Classification
。但我不知道,如何计算 confusion_matrix
for Training set
of KNN Classification
?
如何从以下代码计算 Training set
的 KNN Classification
的 confusion_matrix
?
以下代码用于为 Test set
计算 confusion_matrix
:
# Split test and train data
import numpy as np
from sklearn.model_selection import train_test_split
X = np.array(dataset.ix[:, 1:10])
y = np.array(dataset['benign_malignant'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
#Define Classifier
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
knn.fit(X_train, y_train)
# Predicting the Test set results
y_pred = knn.predict(X_test)
# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred) # Calulate Confusion matrix for test set.
对于 k 折交叉验证:
我也在尝试使用 k-fold cross-validation
为 Training set
查找 confusion_matrix
。
我对这一行感到困惑knn.fit(X_train, y_train)
。
我是否会更改此行 knn.fit(X_train, y_train)
?
我应该在哪里更改 following code
以计算 confusion_matrix
和 training set
?
# Applying k-fold Method
from sklearn.cross_validation import StratifiedKFold
kfold = 10 # no. of folds (better to have this at the start of the code)
skf = StratifiedKFold(y, kfold, random_state = 0)
# Stratified KFold: This first divides the data into k folds. Then it also makes sure that the distribution of the data in each fold follows the original input distribution
# Note: in future versions of scikit.learn, this module will be fused with kfold
skfind = [None]*len(skf) # indices
cnt=0
for train_index in skf:
skfind[cnt] = train_index
cnt = cnt + 1
# skfind[i][0] -> train indices, skfind[i][1] -> test indices
# Supervised Classification with k-fold Cross Validation
from sklearn.metrics import confusion_matrix
from sklearn.neighbors import KNeighborsClassifier
conf_mat = np.zeros((2,2)) # Initializing the Confusion Matrix
n_neighbors = 1; # better to have this at the start of the code
# 10-fold Cross Validation
for i in range(kfold):
train_indices = skfind[i][0]
test_indices = skfind[i][1]
clf = []
clf = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
X_train = X[train_indices]
y_train = y[train_indices]
X_test = X[test_indices]
y_test = y[test_indices]
# fit Training set
clf.fit(X_train,y_train)
# predict Test data
y_predcit_test = []
y_predict_test = clf.predict(X_test) # output is labels and not indices
# Compute confusion matrix
cm = []
cm = confusion_matrix(y_test,y_predict_test)
print(cm)
# conf_mat = conf_mat + cm
您无需进行太多更改
# Predicting the train set results
y_train_pred = knn.predict(X_train)
cm_train = confusion_matrix(y_train, y_train_pred)
这里我们使用 X_train
代替 X_test
进行分类,然后我们使用训练数据集的预测 类 和实际 类 生成分类矩阵.
分类矩阵背后的思想本质上是找出分为四个类别的分类数(如果 y
是二元的)-
- 预测正确但实际错误
- 预测为真,实际为真
- 预测为假但实际上为真
- 预测错误,实际错误
所以只要你有两组——预测的和实际的,你就可以创建混淆矩阵。您所要做的就是预测 类,并使用实际的 类 来获得混淆矩阵。
编辑
在交叉验证部分,可以添加一行y_predict_train = clf.predict(X_train)
来计算每次迭代的混淆矩阵。您可以这样做,因为在循环中,您每次都初始化 clf
,这基本上意味着重置您的模型。
此外,在您的代码中,您每次都会找到混淆矩阵,但您没有将其存储在任何地方。最后,您将只剩下最后一个测试集的厘米。
我是机器学习的新手。最近,我学会了如何计算 confusion_matrix
for Test set
of KNN Classification
。但我不知道,如何计算 confusion_matrix
for Training set
of KNN Classification
?
如何从以下代码计算 Training set
的 KNN Classification
的 confusion_matrix
?
以下代码用于为 Test set
计算 confusion_matrix
:
# Split test and train data
import numpy as np
from sklearn.model_selection import train_test_split
X = np.array(dataset.ix[:, 1:10])
y = np.array(dataset['benign_malignant'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
#Define Classifier
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
knn.fit(X_train, y_train)
# Predicting the Test set results
y_pred = knn.predict(X_test)
# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred) # Calulate Confusion matrix for test set.
对于 k 折交叉验证:
我也在尝试使用 k-fold cross-validation
为 Training set
查找 confusion_matrix
。
我对这一行感到困惑knn.fit(X_train, y_train)
。
我是否会更改此行 knn.fit(X_train, y_train)
?
我应该在哪里更改 following code
以计算 confusion_matrix
和 training set
?
# Applying k-fold Method
from sklearn.cross_validation import StratifiedKFold
kfold = 10 # no. of folds (better to have this at the start of the code)
skf = StratifiedKFold(y, kfold, random_state = 0)
# Stratified KFold: This first divides the data into k folds. Then it also makes sure that the distribution of the data in each fold follows the original input distribution
# Note: in future versions of scikit.learn, this module will be fused with kfold
skfind = [None]*len(skf) # indices
cnt=0
for train_index in skf:
skfind[cnt] = train_index
cnt = cnt + 1
# skfind[i][0] -> train indices, skfind[i][1] -> test indices
# Supervised Classification with k-fold Cross Validation
from sklearn.metrics import confusion_matrix
from sklearn.neighbors import KNeighborsClassifier
conf_mat = np.zeros((2,2)) # Initializing the Confusion Matrix
n_neighbors = 1; # better to have this at the start of the code
# 10-fold Cross Validation
for i in range(kfold):
train_indices = skfind[i][0]
test_indices = skfind[i][1]
clf = []
clf = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
X_train = X[train_indices]
y_train = y[train_indices]
X_test = X[test_indices]
y_test = y[test_indices]
# fit Training set
clf.fit(X_train,y_train)
# predict Test data
y_predcit_test = []
y_predict_test = clf.predict(X_test) # output is labels and not indices
# Compute confusion matrix
cm = []
cm = confusion_matrix(y_test,y_predict_test)
print(cm)
# conf_mat = conf_mat + cm
您无需进行太多更改
# Predicting the train set results
y_train_pred = knn.predict(X_train)
cm_train = confusion_matrix(y_train, y_train_pred)
这里我们使用 X_train
代替 X_test
进行分类,然后我们使用训练数据集的预测 类 和实际 类 生成分类矩阵.
分类矩阵背后的思想本质上是找出分为四个类别的分类数(如果 y
是二元的)-
- 预测正确但实际错误
- 预测为真,实际为真
- 预测为假但实际上为真
- 预测错误,实际错误
所以只要你有两组——预测的和实际的,你就可以创建混淆矩阵。您所要做的就是预测 类,并使用实际的 类 来获得混淆矩阵。
编辑
在交叉验证部分,可以添加一行y_predict_train = clf.predict(X_train)
来计算每次迭代的混淆矩阵。您可以这样做,因为在循环中,您每次都初始化 clf
,这基本上意味着重置您的模型。
此外,在您的代码中,您每次都会找到混淆矩阵,但您没有将其存储在任何地方。最后,您将只剩下最后一个测试集的厘米。