存储一组点（嵌入）以便快速计算对最近点的查询的最有效方法是什么

What is the most efficient way to store a set of points (embeddings) such that queries for closest points are computed quickly

给定一组嵌入，即一组[名称，向量表示] 我应该如何存储它以便快速计算最近点的查询。例如，给定 2-d space 中的 100 个嵌入，如果我在最接近 (10,12) 的 5 个点上查询数据结构，它 returns { [a,(9,11.5)] , [b,(12,14)],...}

简单的方法是计算所有距离、排序和 return 前 k 个点。或者，人们可能会考虑将 mXn space 的 blocks/units 中的二维数组存储在 space 中，以覆盖嵌入的范围 space。我不认为这可以扩展到更高的维度，但我愿意得到纠正。

有标准的近似最近邻库，例如faiss, flann, java-lsh etc. (which are either LSH or Product Quantization based），您可以使用。

最快的解决方案（我发现它很有用）是通过使用 Johnson–Lindenstrauss transform. You can then use Hamming similarity (i.e. 64 minus the number of bits set in a XOR b) to compute the similarity between bit vectors a and b. You could use the POPCOUNT 机器指令将一个向量（比如 100 维）转换为一个长变量（64 位）（这非常有用）快）。

实际上，如果您在 C 中使用 POPCOUNT，即使您对整组二进制变换向量（64 位的长变量）进行完整迭代，它仍然会非常快。

存储一组点（嵌入）以便快速计算对最近点的查询的最有效方法是什么

What is the most efficient way to store a set of points (embeddings) such that queries for closest points are computed quickly

information-retrieval

embedding

data-retrieval

word-embedding