使用 k 最近邻而不拆分成训练集和测试集

Question

我有以下数据集，超过 20,000 行：

我想使用列 A 到 E 来使用 k 最近邻算法预测列 X。我试过使用sklearn的KNeighborsRegressor，如下：

import pandas as pd
import random
from numpy.random import permutation
import math
from sklearn.neighbors import KNeighborsRegressor

df = pd.read_csv("data.csv")

random_indices = permutation(df.index)
test_cutoff = int(math.floor(len(df)/5))
test = df.loc[random_indices[1:test_cutoff]]
train = df.loc[random_indices[test_cutoff:]]

x_columns = ['A', 'B', 'C', D', E']
y_column = ['X']

knn = KNeighborsRegressor(n_neighbors=100, weights='distance')
knn.fit(train[x_columns], train[y_column])
predictions = knn.predict(test[x_columns])

这只对占原始数据集五分之一的测试数据进行预测。我还想要训练数据的预测值。

为此，我尝试实现自己的 k 最近算法，方法是计算每一行与其他每一行的欧氏距离，找到 k 个最短距离，然后对这 k 行的 X 值进行平均。这个过程仅一行就花了 30 多秒，而我有 20,000 多行。有没有更快的方法来做到这一点？

Answer 1

To do this, I tried to implement my own k-nearest algorithm by calculating the Euclidean distance for each row from every other row, finding the k shortest distances, and averaging the X value from those k rows. This process took over 30 seconds for just one row, and I have over 20,000 rows. Is there a quicker way to do this?

是的，问题是 python 中的循环非常慢。您可以做的是 向量化 您的计算。因此，假设您的数据位于矩阵 X (n x d) 中，然后是距离矩阵 D_ij = || X_i - X_j ||^2 是

D = X^2 + X'^2 -  2 X X'

所以在Python

D = (X ** 2).sum(1).reshape(-1, 1) + (X ** 2).sum(1).reshape(1, -1) - 2*X.dot(X.T)

Answer 2

试试这个代码：

import numpy as np
import pandas as pd
from sklearn.model_selection import ShuffleSplit
from sklearn.neighbors import KNeighborsRegressor

df = pd.read_csv("data.csv")
X = np.asarray(df.loc[:, ['A', 'B', 'C', 'D', 'E']])
y = np.asarray(df['X'])

rs = ShuffleSplit(n_splits=1, test_size=1./5, random_state=0)
train_indices, test_indices = rs.split(X).next()

knn = KNeighborsRegressor(n_neighbors=100, weights='distance')
knn.fit(X[train_indices], y[train_indices])

predictions = knn.predict(X)

您的解决方案的主要区别在于 ShuffleSplit 的使用。

备注：

predictions 包含所有数据（测试和训练）的预测值。
可以通过参数test_size调整测试数据的比例（我用了你的设置，即五分之一）。
迭代器需要调用方法next()来生成训练数据和测试数据的索引。

Answer 3

如果您只想对训练数据进行预测，则无需将数据拆分为训练和测试。

您可以只拟合原始数据，然后对其进行预测。

model.fit(original data, target)
model.predict(original data)

使用 k 最近邻而不拆分成训练集和测试集

Using k-nearest neighbour without splitting into training and test sets

python

numpy

machine-learning

nearest-neighbor

scikit-learn