向 DataFrame 添加一个值为 1 的列，其中预测值大于自定义阈值

Question

我正在尝试向 DataFrame 添加一列，当输出 class 概率很高时，该列应具有值 1。像这样：

val output = predictions
    .withColumn(
        "easy", 
        when( $"label" === $"prediction" && 
              $"probability" > 0.95, 1).otherwise(0)
    )

问题是，probability 是一个 Vector，而 0.95 是一个 Double，所以上面的方法不起作用。我真正需要的是 max($"probability") > 0.95 但这当然也行不通。

完成此任务的正确方法是什么？

Answer 1

定义 UDF

val findP = udf((label: <type>, prediction: <type>, probability: <type> ) => {
if (label == prediction && vector.toArray.max > 0.95) 1 else 0
})

在 withCoulmn() 中使用 UDF

val output = predictions.withColumn("easy",findP($"lable",$"prediction",$"probability"))

Answer 2

使用 udf，例如：

val func = (label: String, prediction: String, vector: Vector) => {
  if(label == prediction && vector.toArray.max > 0.95) 1 else 0
}
val output = predictions
  .select($"label", func($"label", $"prediction", $"probability").as("easy"))

Answer 3

这里有一个简单的例子来实现你的问题。创建一个 udf 并为新添加的列创建传递概率列和 return 0 或 1。在一行中，使用 WrappedArray 代替 Array、Vector。

  val spark = SparkSession.builder().master("local").getOrCreate()

  import spark.implicits._

  val data = spark.sparkContext.parallelize(Seq(
    (Vector(0.78, 0.98, 0.97), 1), (Vector(0.78, 0.96), 2), (Vector(0.78, 0.50), 3)
  )).toDF("probability", "id")


  data.withColumn("label", label($"probability")).show()

  def label = udf((prob: mutable.WrappedArray[Double]) => {
    if (prob.max >= 0.95) 1 else 0
  })

输出：

+------------------+---+-----+
|       probability| id|label|
+------------------+---+-----+
|[0.78, 0.98, 0.97]|  1|    1|
|      [0.78, 0.96]|  2|    1|
|       [0.78, 0.5]|  3|    0|
+------------------+---+-----+

向 DataFrame 添加一个值为 1 的列，其中预测值大于自定义阈值

Add a column to DataFrame with value of 1 where prediction greater than a custom threshold

scala

spark-dataframe

apache-spark-mllib