K 最近邻误差 - X 代表什么？

Question

以下代码生成 'X' 未定义的错误：

import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import NearestNeighbors

A = np.array([[3.1, 2.3], [2.3, 4.2], [3.9, 3.5], [3.7, 6.4], [4.8, 1.9], 
             [8.3, 3.1], [5.2, 7.5], [4.8, 4.7], [3.5, 5.1], [4.4, 2.9],])

k = 3

test_data = [3.3, 2.9]

plt.figure()
plt.title('Input data')
plt.scatter(A[:,0], A[:,1], marker = 'o', s = 100, color = 'black')
plt.show()

knn_model = NearestNeighbors(n_neighbors = k, algorithm = 'auto').fit(X)
distances, indices = knn_model.kneighbors([test_data])

print("\nK Nearest Neighbors:")
for rank, index in enumerate(indices[0][:k], start = 1):
   print(str(rank) + " is", A[index])

plt.figure()
plt.title('Nearest neighbors')
plt.scatter(A[:, 0], X[:, 1], marker = 'o', s = 100, color = 'k')
plt.scatter(A[indices][0][:][:, 0], A[indices][0][:][:, 1],
   marker = 'o', s = 250, color = 'k', facecolors = 'none')
plt.scatter(test_data[0], test_data[1],
   marker = 'x', s = 100, color = 'k')
plt.show()

但是，将 'X' 替换为 'A' 时错误消失。据我了解，X 是训练数据 - 这是正确的吗？如果是这样，我应该为 X 使用什么？

Answer 1

嗯，X应该是自变量，y是因变量，对吧。

在研究中，变量是可以取不同值的任何特征，例如身高、年龄、物种或考试成绩。

在科学研究中，我们经常想研究一个变量对另一个变量的影响。例如，您可能想测试花更多时间学习的学生是否获得更好的考试成绩。

cause-and-effect 关系研究中的变量称为自变量和因变量。

The independent variable is the cause. Its value is independent of other variables in your study.
The dependent variable is the effect. Its value depends on changes in the independent variable.

下面是一些通用代码来说明这一点。

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
#%matplotlib inline

df = pd.read_csv("C:\your_path_here\classified_data.csv",index_col=0)
df.head()

df.info()

df.describe()

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()


scaler.fit(df.drop('TARGET CLASS',axis=1))
scaled_features = scaler.transform(df.drop('TARGET CLASS',axis=1))


df_feat = pd.DataFrame(scaled_features,columns=df.columns[:-1])
df_feat.head()


from sklearn.model_selection import train_test_split
X = df_feat
y = df['TARGET CLASS']
X_train, X_test, y_train, y_test = train_test_split(scaled_features,df['TARGET CLASS'], test_size=0.50, random_state=101)


from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train,y_train)


KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=1, p=2,
           weights='uniform')


pred = knn.predict(X_test)


from sklearn.metrics import classification_report,confusion_matrix
conf_mat=confusion_matrix(y_test,pred)
print(conf_mat)


print(classification_report(y_test,pred))

结果：

              precision    recall  f1-score   support

           0       0.88      0.90      0.89       250
           1       0.90      0.87      0.89       250

    accuracy                           0.89       500
   macro avg       0.89      0.89      0.89       500
weighted avg       0.89      0.89      0.89       500

您可以从此 link 下载示例数据。

https://www.kaggle.com/shubh247/simple-way-handle-classified-data-using-knn?select=Classified+Data

K 最近邻误差 - X 代表什么？

K-Nearest Neighbours Error - What is X meant to represent?

python

cluster-analysis

machine-learning

scikit-learn