pandas 数据帧中所有对的有效 k 最近邻

Question

我有一个包含 20K 行和 50 列的 pandas 数据框。我想根据列的欧几里得距离在此数据框中找到每行的 5 个最近邻居。所以结果是一个 20K * 5 的矩阵，其中列是数据帧中最近邻居的 ID。

我正在寻找尽可能高效地执行此操作的解决方案，最好使用 pandas 提供的索引、并行操作或向量化操作。 Scipy kd-tree 很慢。

有什么想法吗？

Answer 1

确实 Scipy 的 kd-tree 对你的情况来说确实很慢；查询单个点大约需要 80 毫秒，我猜这会导致整个数据集的总计算时间约为 0.08 * 20_000 = 1600 秒。

高维数据（例如具有 50 列的数据集）的另一个选项可能是 Ball Tree 数据结构。正如 link 中的页面所说：

Because of the spherical geometry of the ball tree nodes, it can out-perform a KD-tree in high dimensions, though the actual performance is highly dependent on the structure of the training data.

尝试使用以下代码：

from sklearn.neighbors import NearestNeighbors
import numpy as np

arr = np.random.rand(20_000, 50) * 20
nbrs = NearestNeighbors(n_neighbors = 5, algorithm = 'ball_tree').fit(arr)

%timeit nbrs.kneighbors(arr[:10, :])
# 24.6 ms ± 2.24 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit nbrs.kneighbors(arr[:100, :])
# 209 ms ± 22.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit nbrs.kneighbors(arr[:1000, :])
# 2.02 s ± 226 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

查看这些 %timeit 结果，算法似乎大致呈线性扩展，因此对于 20k 行，您可能预计需要大约 20_000 / 1_000 * 2 = ~40s。 40 秒比您最有可能从 kd 树数据结构中期望的 ~1600 秒快得多。

最后，我绝对建议您彻底阅读 nearest neighbors 页面，以便您完全理解他们提供的算法的所有复杂性。

pandas 数据帧中所有对的有效 k 最近邻

Efficient k-nearest neighbor of all pairs in a pandas dataframe

python

nearest-neighbor

pandas