火花 2.1 中的欧氏距离

Euclidean distance in spark 2.1

我正在尝试计算两个向量的欧氏距离。我有以下数据框:

root
 |-- h: string (nullable = true)
 |-- id: string (nullable = true)
 |-- sid: string (nullable = true)
 |-- features: vector (nullable = true)
 |-- episodeFeatures: vector (nullable = true)

import org.apache.spark.mllib.util.{MLUtils}
val jP2 = jP.withColumn("dist", MLUtils.fastSquaredDistance("features", 5, "episodeFeatures", 5)) 

我收到这样的错误:

error: method fastSquaredDistance in object MLUtils cannot be accessed in object org.apache.spark.mllib.util.MLUtils

有没有办法访问那个私有方法?

MLUtils 是内部包,即使不是那个,它也不能用于 Columns 或(从版本猜测)ml 向量。你必须自己设计 udf:

import org.apache.spark.sql.functions._
import org.apache.spark.ml.linalg.Vector

val euclidean = udf((v1: Vector, v2: Vector) => ???)  // Fill with preferred logic

val jP2 = jP.withColumn("dist", euclidean($"features", $"episodeFeatures"))