Spark DataFrame UDF 分区列
Spark DataFrame UDF Partitioning Columns
我想转换一个列。新列应该只包含原始列的一个分区。我定义了以下 udf:
def extract (index : Integer) = udf((v: Seq[Double]) => v.grouped(16).toSeq(index))
稍后在循环中使用它
myDF = myDF.withColumn("measurement_"+i,extract(i)($"vector"))
原始向量列是使用以下方法创建的:
var vectors :Seq[Seq[Double]] = myVectors
vectors.toDF("vector")
但最后我得到以下错误:
Failed to execute user defined function(anonfun$user$sparkapp$MyClass$$extract: (array<double>) => array<double>)
我是否错误地定义了 udf?
当我尝试提取不存在的元素时,我可以重现错误,即给出一个大于序列长度的索引:
val myDF = Seq(Seq(1.0, 2.0 ,3, 4.0), Seq(4.0,3,2,1)).toDF("vector")
myDF: org.apache.spark.sql.DataFrame = [vector: array<double>]
def extract (index : Integer) = udf((v: Seq[Double]) => v.grouped(2).toSeq(index))
// extract: (index: Integer)org.apache.spark.sql.expressions.UserDefinedFunction
val i = 2
myDF.withColumn("measurement_"+i,extract(i)($"vector")).show
出现此错误:
org.apache.spark.SparkException: Failed to execute user defined function($anonfun$extract: (array<double>) => array<double>)
很可能您在执行 toSeq(index)
时遇到了同样的问题,请尝试使用 toSeq.lift(index)
其中 returns None 如果索引超出范围:
def extract (index : Integer) = udf((v: Seq[Double]) => v.grouped(2).toSeq.lift(index))
extract: (index: Integer)org.apache.spark.sql.expressions.UserDefinedFunction
正常指数:
val i = 1
myDF.withColumn("measurement_"+i,extract(i)($"vector")).show
+--------------------+-------------+
| vector|measurement_1|
+--------------------+-------------+
|[1.0, 2.0, 3.0, 4.0]| [3.0, 4.0]|
|[4.0, 3.0, 2.0, 1.0]| [2.0, 1.0]|
+--------------------+-------------+
索引越界:
val i = 2
myDF.withColumn("measurement_"+i,extract(i)($"vector")).show
+--------------------+-------------+
| vector|measurement_2|
+--------------------+-------------+
|[1.0, 2.0, 3.0, 4.0]| null|
|[4.0, 3.0, 2.0, 1.0]| null|
+--------------------+-------------+
我想转换一个列。新列应该只包含原始列的一个分区。我定义了以下 udf:
def extract (index : Integer) = udf((v: Seq[Double]) => v.grouped(16).toSeq(index))
稍后在循环中使用它
myDF = myDF.withColumn("measurement_"+i,extract(i)($"vector"))
原始向量列是使用以下方法创建的:
var vectors :Seq[Seq[Double]] = myVectors
vectors.toDF("vector")
但最后我得到以下错误:
Failed to execute user defined function(anonfun$user$sparkapp$MyClass$$extract: (array<double>) => array<double>)
我是否错误地定义了 udf?
当我尝试提取不存在的元素时,我可以重现错误,即给出一个大于序列长度的索引:
val myDF = Seq(Seq(1.0, 2.0 ,3, 4.0), Seq(4.0,3,2,1)).toDF("vector")
myDF: org.apache.spark.sql.DataFrame = [vector: array<double>]
def extract (index : Integer) = udf((v: Seq[Double]) => v.grouped(2).toSeq(index))
// extract: (index: Integer)org.apache.spark.sql.expressions.UserDefinedFunction
val i = 2
myDF.withColumn("measurement_"+i,extract(i)($"vector")).show
出现此错误:
org.apache.spark.SparkException: Failed to execute user defined function($anonfun$extract: (array<double>) => array<double>)
很可能您在执行 toSeq(index)
时遇到了同样的问题,请尝试使用 toSeq.lift(index)
其中 returns None 如果索引超出范围:
def extract (index : Integer) = udf((v: Seq[Double]) => v.grouped(2).toSeq.lift(index))
extract: (index: Integer)org.apache.spark.sql.expressions.UserDefinedFunction
正常指数:
val i = 1
myDF.withColumn("measurement_"+i,extract(i)($"vector")).show
+--------------------+-------------+
| vector|measurement_1|
+--------------------+-------------+
|[1.0, 2.0, 3.0, 4.0]| [3.0, 4.0]|
|[4.0, 3.0, 2.0, 1.0]| [2.0, 1.0]|
+--------------------+-------------+
索引越界:
val i = 2
myDF.withColumn("measurement_"+i,extract(i)($"vector")).show
+--------------------+-------------+
| vector|measurement_2|
+--------------------+-------------+
|[1.0, 2.0, 3.0, 4.0]| null|
|[4.0, 3.0, 2.0, 1.0]| null|
+--------------------+-------------+