有人知道如何使用 Spark MLlib 提供的近似最近邻搜索吗?

Does anybody know how to use the Approximate Nearest Neighbor Search provide by Spark MLlib?

我想使用 Spark MLlib (ref.) 提供的近似最近邻搜索,但我非常迷茫,因为我没有找到示例或其他东西来指导我。为前一个 link 提供的唯一信息是:

Approximate nearest neighbor search takes a dataset (of feature vectors) and a key (a single feature vector), and it approximately returns a specified number of rows in the dataset that are closest to the vector.

Approximate nearest neighbor search accepts both transformed and untransformed datasets as input. If an untransformed dataset is used, it will be transformed automatically. In this case, the hash signature will be created as outputCol.

A distance column will be added to the output dataset to show the true distance between each output row and the searched key.

Note: Approximate nearest neighbor search will return fewer than k rows when there are not enough candidates in the hash bucket.

有人知道如何使用 Spark MLlib 提供的近似最近邻搜索吗?

在这里你可以找到一个例子https://spark.apache.org/docs/2.1.0/ml-features.html#lsh-algorithms :

import org.apache.spark.ml.feature.BucketedRandomProjectionLSH
import org.apache.spark.ml.linalg.Vectors

val dfA = spark.createDataFrame(Seq(
  (0, Vectors.dense(1.0, 1.0)),
  (1, Vectors.dense(1.0, -1.0)),
  (2, Vectors.dense(-1.0, -1.0)),
  (3, Vectors.dense(-1.0, 1.0))
)).toDF("id", "keys")

val dfB = spark.createDataFrame(Seq(
  (4, Vectors.dense(1.0, 0.0)),
  (5, Vectors.dense(-1.0, 0.0)),
  (6, Vectors.dense(0.0, 1.0)),
  (7, Vectors.dense(0.0, -1.0))
)).toDF("id", "keys")

val key = Vectors.dense(1.0, 0.0)

val brp = new BucketedRandomProjectionLSH()
  .setBucketLength(2.0)
  .setNumHashTables(3)
  .setInputCol("keys")
  .setOutputCol("values")

val model = brp.fit(dfA)

// Feature Transformation
model.transform(dfA).show()
// Cache the transformed columns
val transformedA = model.transform(dfA).cache()
val transformedB = model.transform(dfB).cache()

// Approximate similarity join
model.approxSimilarityJoin(dfA, dfB, 1.5).show()
model.approxSimilarityJoin(transformedA, transformedB, 1.5).show()
// Self Join
model.approxSimilarityJoin(dfA, dfA, 2.5).filter("datasetA.id < datasetB.id").show()

// Approximate nearest neighbor search
model.approxNearestNeighbors(dfA, key, 2).show()
model.approxNearestNeighbors(transformedA, key, 2).show()

以上代码来自spark文档。