Knn 预测在 y_test 上达到 100%

Question

我正在尝试在 Iris 数据集上实施 K 最近邻，但在进行预测后，yhat 100% 没有错误，一定有问题，我不知道它是什么...

我创建了一个名为 class_id 的列，我在其中更改了：

setosa = 1.0
杂色 = 2.0
弗吉尼亚州 = 3.0

该列的类型为 float。

得到 X 和 Y


    x = df[['sepal length', 'sepal width', 'petal length', 'petal width']].values

type(x) 显示 nparray


    y = df['class_id'].values

type(y) 显示 nparray

标准化数据


    x = preprocessing.StandardScaler().fit(x).transform(x.astype(float))

正在创建训练和测试


    x_train, x_test, y_train, y_test = train_test_split(x,y,test_size = 0.2, random_state = 42)

正在查看最佳K值：


    Ks = 12
    for i in range(1,Ks):
       k = i
       neigh = KNeighborsClassifier(n_neighbors=k).fit(x_train,y_train)
       yhat = neigh.predict(x_test)
       score = metrics.accuracy_score(y_test,yhat)
       print('K: ', k, ' score: ', score, '\n')

结果：

K:1 得分：0.9666666666666667

K:2 得分:1.0

K:3 得分:1.0

K:4 评分:1.0

K:5 得分:1.0

K:6 得分:1.0

K:7 得分:1.0

K:8 评分:1.0

K:9 评分:1.0

K:10 得分:1.0

K:11 得分:1.0

打印 y_test 和 yhat WITH K = 5


    print(yhat)
    print(y_test)

结果：

yhat: [2. 1. 3. 2. 2. 1. 2. 3. 2. 2. 3. 1. 1. 1. 1. 2. 3. 2. 2. 3. 1. 3. 1. 3. 3.3.3.3.1.1.]

y_test: [2. 1. 3. 2. 2. 1. 2. 3. 2. 2. 3. 1. 1. 1. 1. 2. 3. 2. 2. 3. 1. 3. 1. 3. 3.3.3.3.1.1.]

不应该都是100%正确的，一定有错

Answer 1

尝试做一个混淆矩阵。测试您的测试数据的每个示例，并检查特异性、敏感性、准确性和精密度的指标。

其中：

TN = True Negative
FN = False Negative
FP = False Positive
TP = True Positive

在这里您可以检查特异性和敏感性之间的区别 https://dzone.com/articles/ml-metrics-sensitivity-vs-specificity-difference

这里有一个示例，说明如何使用 sklearn 在 python 中获得一个混淆矩阵。

同时尝试制作 ROC 曲线（可选） https://en.wikipedia.org/wiki/Receiver_operating_characteristic

Answer 2

我在 skillsmuggler(user) 的解释中找到了答案：

You are making use of the iris dataset. It's a well cleaned and model dataset. The features have a strong correlation to the result which results in the kNN model fitting the data really well. To test this you can reduce the size of the training set and this will results in a drop in the accuracy.

预测模型正确。

Knn 预测在 y_test 上达到 100%

Knn prediction going 100% on y_test

python

knn

scikit-learn

data-science

iris-dataset

得到 X 和 Y

标准化数据

正在创建训练和测试

正在查看最佳K值：

结果：

打印 y_test 和 yhat WITH K = 5

结果：