为什么 sklearn kNN 分类器在我的训练样本和测试样本数量很大的情况下运行得如此之快

Why sklearn's kNN classifer runs so fast while the number of my training samples and test samples are large

根据我的理解,对于每个测试样本,kNN分类器算法计算当前测试样本与所有训练样本的距离和select一定数量的最近邻,并确定标签测试样品,然后下一个测试样品将完成。

我的代码类似于以下超链接中的示例 kNN 分类器代码,非常简单:

https://tutorialspoint.dev/computer-science/machine-learning/multiclass-classification-using-scikit-learn

我的训练样本数是8000,测试样本数是1500,样本维度是12。

我运行sklearn kNN分类器代码时,只用了2秒,准确率不错

怀疑sklearn的kNN算法耗时,写了个简单的代码计算测试样本和所有训练样本的距离,发现很耗时,连排序都不算算法。计算距离的代码如下:

for i in range(X_test.shape[0]):
    for j in range(X_train.shape[0]):
        ## calculate distances between a test sample and all train samples
        Distance[j,0] = (X_test.iloc[i,0]-X_train.iloc[j,0])*(X_test.iloc[i,0]-X_train.iloc[j,0]) + \
                   (X_test.iloc[i,1]-X_train.iloc[j,1])*(X_test.iloc[i,1]-X_train.iloc[j,1]) + \
                   (X_test.iloc[i,2]-X_train.iloc[j,2])*(X_test.iloc[i,2]-X_train.iloc[j,2]) + \
                   (X_test.iloc[i,3]-X_train.iloc[j,3])*(X_test.iloc[i,3]-X_train.iloc[j,3]) + \
                   (X_test.iloc[i,4]-X_train.iloc[j,4])*(X_test.iloc[i,4]-X_train.iloc[j,4]) + \
                   (X_test.iloc[i,5]-X_train.iloc[j,5])*(X_test.iloc[i,5]-X_train.iloc[j,5]) + \
                   (X_test.iloc[i,6]-X_train.iloc[j,6])*(X_test.iloc[i,6]-X_train.iloc[j,6]) + \
                   (X_test.iloc[i,7]-X_train.iloc[j,7])*(X_test.iloc[i,7]-X_train.iloc[j,7]) + \
                   (X_test.iloc[i,8]-X_train.iloc[j,8])*(X_test.iloc[i,8]-X_train.iloc[j,8]) + \
                   (X_test.iloc[i,9]-X_train.iloc[j,9])*(X_test.iloc[i,9]-X_train.iloc[j,9]) + \
                   (X_test.iloc[i,10]-X_train.iloc[j,10])*(X_test.iloc[i,10]-X_train.iloc[j,10]) + \
                   (X_test.iloc[i,11]-X_train.iloc[j,11])*(X_test.iloc[i,11]-X_train.iloc[j,11])

我不确定sklearn是否使用整个训练数据集来计算k个最近邻居。如果是,sklearn使用什么优化算法?

提前致谢。

你说得对,最近邻搜索真的很耗时。如果您天真地这样做,您的运行时间为 O(n^2)。好消息是,sklearn 使用一些聪明的算法来规避所有距离的计算。

看看docs and you will see that one parameter is the algorithm used for nearest-neighbor search, e.g. BallTree。这些算法大大加快了计算速度。

另一方面,您的代码效率有点低。您可以这样做,而不是手动计算每个维度:

((X_test.iloc[i,:] - X_train.iloc[j,:]) ** 2).sum()

这利用了 pandas' 向量化函数,使其速度更快。