Scala - 如何将 GMM 模型与数据拟合时获得的概率列(向量列)拆分为两个单独的列?

Scala - How to split the probability column (column of vectors) that we obtain when we fit the GMM model to the data in to two separate columns?

我正在尝试执行以下操作:

+-----+-------------------------+----------+-------------------------------------------+
|label|features                 |prediction|probability                                |
+-----+-------------------------+----------+-------------------------------------------+
|0.0  |(3,[],[])                |0         |[0.9999999999999979,2.093996169658831E-15] |
|1.0  |(3,[0,1,2],[0.1,0.1,0.1])|0         |[0.999999999999999,9.891337521299582E-16]  |
|2.0  |(3,[0,1,2],[0.2,0.2,0.2])|0         |[0.9999999999999979,2.0939961696578572E-15]|
|3.0  |(3,[0,1,2],[9.0,9.0,9.0])|1         |[2.093996169659668E-15,0.9999999999999979] |
|4.0  |(3,[0,1,2],[9.1,9.1,9.1])|1         |[9.89133752128275E-16,0.999999999999999]   |
|5.0  |(3,[0,1,2],[9.2,9.2,9.2])|1         |[2.0939961696605603E-15,0.9999999999999979]|
+-----+-------------------------+----------+-------------------------------------------+

将上面的数据框转换为多两列:prob1 & prob2 每列都具有 probability 列中显示的相应值。

我发现了类似的问题 - 一个在 PySpark 中,另一个在 Scala 中。我不知道如何翻译 PySpark 代码,我在使用 Scala 代码时遇到错误。

PySpark 代码:

split1_udf = udf(lambda value: value[0].item(), FloatType())
split2_udf = udf(lambda value: value[1].item(), FloatType())

output2 = randomforestoutput.select(split1_udf('probability').alias('c1'), split2_udf('probability').alias('c2'))

或将这些列附加到原始数据框:

randomforestoutput.withColumn('c1', split1_udf('probability')).withColumn('c2', split2_udf('probability'))

Scala 代码:

import org.apache.spark.sql.functions.udf

val getPOne = udf((v: org.apache.spark.mllib.linalg.Vector) => v(1))
model.transform(testDf).select(getPOne($"probability"))

当我 运行 Scala 代码时出现以下错误:

scala> predictions.select(getPOne(col("probability"))).show(false)
org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(probability)' due to data type mismatch: argument 1 requires vector type, however, '`probability`' is of vector type.;;
'Project [UDF(probability#39) AS UDF(probability)#135]
+- Project [label#0, features#1, prediction#34, UDF(features#1) AS probability#39]
   +- Project [label#0, features#1, UDF(features#1) AS prediction#34]
      +- Relation[label#0,features#1] libsvm

我目前使用的是 Scala 2.11.11 和 Spark 2.1.1

我从你的问题中了解到,你正试图将 split probability 列分成两列 prob1prob2。如果是这样,那么一个简单的 array 功能和 withColumn 应该可以解决您的问题。

predictions
  .withColumn("prob1", $"probability"(0))
  .withColumn("prob2", $"probability"(1))
  .drop("probability")

您可以找到more functions,可以帮助您将来应用于dataframes

已编辑

我创建了一个临时 dataframe 来匹配你的 column 作为

val predictions = Seq(Array(1.0,2.0), Array(2.0939961696605603E-15,0.9999999999999979), Array(Double.NaN,Double.NaN)).toDF("probability")
+--------------------------------------------+
|probability                                 |
+--------------------------------------------+
|[1.0, 2.0]                                  |
|[2.0939961696605603E-15, 0.9999999999999979]|
|[NaN, NaN]                                  |
+--------------------------------------------+

我应用了上面的 withColumns 结果

+----------------------+------------------+
|prob1                 |prob2             |
+----------------------+------------------+
|1.0                   |2.0               |
|2.0939961696605603E-15|0.9999999999999979|
|NaN                   |NaN               |
+----------------------+------------------+

架构不匹配编辑

既然你的 probability 列的 Vector schema 与上面 arrayType schema 的解决方案不匹配,上面的解决方案将不起作用在你的情况下。请使用以下解决方案。

您将必须创建 udf 函数和 return 预期值

   val first = udf((v: Vector) => v.toArray(0))
    val second = udf((v: Vector) => v.toArray(1))
    predictions
      .withColumn("prob1", first($"probability"))
      .withColumn("prob2", second($"probability"))
      .drop("probability")

希望你能得到想要的结果。