NearestNeighbors 权衡 - 运行更快，但结果不太准确

Question

我正在处理中型数据集 (shape=(14013L, 46L))。我想用 knn 平滑每个样本。我正在训练我的模型：

NearestNeighbors(n_neighbors, algorithm='ball_tree',    
    metric=sklearn.metrics.pairwise.cosine_distances)

平滑如下：

def smooth(x,nbrs,data,alpha):
    """
    input:
        alpha: the smoothing factor
        nbrs: trained NearestNeighbors from sklearn
        data: the original data 
             (since NearestNeighbors returns only the index and not the samples)
        x:    what we want to smooth
    output:
        smoothed x with its nearest neighbours
    """
    distances, indices = nbrs.kneighbors(x)
    distances = map(lambda z:abs(-z+1),distances)[0]
    norm = sum(distances)
    if norm == 0:
        "No neighbours were found."
        return x
    distances = map(lambda z: (1-alpha)*z/norm ,distances)
    indices = map(lambda z: data[z],indices)[0]
    other = np.array([indices[i] * distances[i] for i in range(len(distances))])
    z = x * alpha
    z = z.reshape((1,z.shape[0]))
    smoothed = sum(np.concatenate((other,z),axis=0))
    return smoothed

我的问题：

怎么可能没有找到邻居？（我在我的数据集上经历过，因此 if 条件）
拟合 ("training") 需要 18 秒，但平滑 ~1000 个样本需要 20 多分钟。如果平滑过程会更短，我愿意得到不太准确的结果。可能吗？我应该更改哪个 parameters 才能实现它？

Answer 1

没有你的数据集和完整的代码，很难确定。这是我的想法。

distances = map(lambda z:abs(-z+1),distances)[0]
norm = sum(distances)

因为您索引了地图的结果，所以您应该只获取第一个邻居。如果 x 实际上是您用来训练的点之一，那么第一个最近的邻居将是...x。因为您使用的是余弦距离，所以该距离正好是：1. abs(1-1) == 0。在我建议更正之前，让我们谈谈性能。

关于性能：您到处都在使用 map 函数，这是一个 Python 内置函数。 Scikit-learn 建立在 numpy 之上，其设计目的是比原生 Python 代码更快地进行数学计算。这就是为什么训练比您的代码快得多的原因。您应该使用 numpy 算法而不是 map。一个例子：这一行

distances = map(lambda z: (1-alpha)*z/norm ,distances)

应该是这个

distances *= ((1-alpha)/norm)

如果 norm 是一个数组，它应该具有正确的维度以便 numpy 广播规则启动并 "the right thing" 完成，完全在 C 中。

好的，既然我建议使用 numpy 数组（而不是 map 和 Python 列表），我相信 if 语句之前的两行是正确的是删除索引。（您的代码的其余部分也可能被索引破坏；在函数的第二行之后，distances 不是数组或列表，而是标量。 )

distances = np.abs( distances-1 )
norm = np.sum(distances)

此外，您不应多次调用 smooth() 函数，每个样本调用一次。您应该在 N_samples by N_dimensions==46 numpy 数组上调用它，并适当调整您的 smooth() 代码。（例如，NearestNeighbors return N_samples by N_neighbors 数组比 returning N_samples 长度为 N_samples 的单个数组快得多。 )

Answer 2

首先，为什么要用球树？也许您的指标确实向您暗示了这一点，但如果不是这种情况，您也可以使用 kd-tree。

我将从理论的角度来回答你的问题。 radius 参数默认设置为 1.0。这对于您的数据集来说可能太小了，因为如果我理解正确，这将指定要查看的查询的半径。所以，我建议运行你的代码增加这个值，直到你得到一些结果。然后减少它，直到没有结果。再次增加一点，您就拥有了适合您的数据集的最佳参数。

一个重要的参数是leaf_size，它实际上影响了查询到达时你要检查多少个点。此参数的较小值可能会产生更快的执行速度，但精度较低。

您可能还想查看我的 this 答案，它解释了速度和准确性之间的权衡，这种权衡在进行最近邻搜索时至关重要。

NearestNeighbors 权衡 - 运行更快，但结果不太准确

NearestNeighbors tradeoff - run faster with less accurate results

python

smooth