将 UDF 应用于 Spark Dataframe 中的多个列
Apply UDF to multiple columns in Spark Dataframe
我有一个如下所示的数据框
| id| age| rbc| bgr| dm|cad|appet| pe|ane|classification|
+---+----+------+-----+---+---+-----+---+---+--------------+
| 3|48.0|normal|117.0| no| no| poor|yes|yes| ckd|
....
....
....
我已经编写了一个 UDF 来将分类 yes, no, poor, normal
转换为二进制 0s
和 1s
def stringToBinary(stringValue: String): Int = {
stringValue match {
case "yes" => return 1
case "no" => return 0
case "present" => return 1
case "notpresent" => return 0
case "normal" => return 1
case "abnormal" => return 0
}
}
val stringToBinaryUDF = udf(stringToBinary _)
我将其应用于数据框,如下所示
val newCol = stringToBinaryUDF.apply(col("pc")) //creates the new column with formatted value
val refined1 = noZeroDF.withColumn("dm", newCol) //adds the new column to original
如何将多个列传递到 UDF 中,这样我就不必为其他分类列重复自己的操作?
A UDF 可以接受很多参数,即很多列,但它应该 return 一个结果,即一列。
为此,只需将参数添加到您的 stringToBinary
函数即可。
如果您希望它包含两列,它将如下所示:
def stringToBinary(stringValue: String, secondValue: String): Int = {
stringValue match {
case "yes" => return 1
case "no" => return 0
case "present" => return 1
case "notpresent" => return 0
case "normal" => return 1
case "abnormal" => return 0
}
}
val stringToBinaryUDF = udf(stringToBinary _)
希望对您有所帮助
如果您有 spark
函数来执行与 udf
函数序列化和反序列化列数据相同的工作,则不应选择 udf
函数。
给定 dataframe
作为
+---+----+------+-----+---+---+-----+---+---+--------------+
|id |age |rbc |bgr |dm |cad|appet|pe |ane|classification|
+---+----+------+-----+---+---+-----+---+---+--------------+
|3 |48.0|normal|117.0|no |no |poor |yes|yes|ckd |
+---+----+------+-----+---+---+-----+---+---+--------------+
您可以使用 when
函数实现您的要求,如
import org.apache.spark.sql.functions._
def applyFunction(column : Column) = when(column === "yes" || column === "present" || column === "normal", lit(1))
.otherwise(when(column === "no" || column === "notpresent" || column === "abnormal", lit(0)).otherwise(column))
df.withColumn("dm", applyFunction(col("dm")))
.withColumn("cad", applyFunction(col("cad")))
.withColumn("rbc", applyFunction(col("rbc")))
.withColumn("pe", applyFunction(col("pe")))
.withColumn("ane", applyFunction(col("ane")))
.show(false)
结果是
+---+----+---+-----+---+---+-----+---+---+--------------+
|id |age |rbc|bgr |dm |cad|appet|pe |ane|classification|
+---+----+---+-----+---+---+-----+---+---+--------------+
|3 |48.0|1 |117.0|0 |0 |poor |1 |1 |ckd |
+---+----+---+-----+---+---+-----+---+---+--------------+
现在问题清楚地表明您不想对所有列重复该过程,因此您可以执行以下操作
val columnsTomap = df.select("rbc", "cad", "rbc", "pe", "ane").columns
var tempdf = df
columnsTomap.map(column => {
tempdf = tempdf.withColumn(column, applyFunction(col(column)))
})
tempdf.show(false)
您也可以使用 foldLeft
函数。让你的 UDF 调用 stringToBinaryUDF
:
import org.apache.spark.sql.functions._
val categoricalColumns = Seq("rbc", "cad", "rbc", "pe", "ane")
val refinedDF = categoricalColumns
.foldLeft(noZeroDF) { (accumulatorDF: DataFrame, columnName: String) =>
accumulatorDF
.withColumn(columnName, stringToBinaryUDF(col(columnName)))
}
这将尊重不变性和函数式编程。
我有一个如下所示的数据框
| id| age| rbc| bgr| dm|cad|appet| pe|ane|classification|
+---+----+------+-----+---+---+-----+---+---+--------------+
| 3|48.0|normal|117.0| no| no| poor|yes|yes| ckd|
....
....
....
我已经编写了一个 UDF 来将分类 yes, no, poor, normal
转换为二进制 0s
和 1s
def stringToBinary(stringValue: String): Int = {
stringValue match {
case "yes" => return 1
case "no" => return 0
case "present" => return 1
case "notpresent" => return 0
case "normal" => return 1
case "abnormal" => return 0
}
}
val stringToBinaryUDF = udf(stringToBinary _)
我将其应用于数据框,如下所示
val newCol = stringToBinaryUDF.apply(col("pc")) //creates the new column with formatted value
val refined1 = noZeroDF.withColumn("dm", newCol) //adds the new column to original
如何将多个列传递到 UDF 中,这样我就不必为其他分类列重复自己的操作?
A UDF 可以接受很多参数,即很多列,但它应该 return 一个结果,即一列。
为此,只需将参数添加到您的 stringToBinary
函数即可。
如果您希望它包含两列,它将如下所示:
def stringToBinary(stringValue: String, secondValue: String): Int = {
stringValue match {
case "yes" => return 1
case "no" => return 0
case "present" => return 1
case "notpresent" => return 0
case "normal" => return 1
case "abnormal" => return 0
}
}
val stringToBinaryUDF = udf(stringToBinary _)
希望对您有所帮助
spark
函数来执行与 udf
函数序列化和反序列化列数据相同的工作,则不应选择 udf
函数。
给定 dataframe
作为
+---+----+------+-----+---+---+-----+---+---+--------------+
|id |age |rbc |bgr |dm |cad|appet|pe |ane|classification|
+---+----+------+-----+---+---+-----+---+---+--------------+
|3 |48.0|normal|117.0|no |no |poor |yes|yes|ckd |
+---+----+------+-----+---+---+-----+---+---+--------------+
您可以使用 when
函数实现您的要求,如
import org.apache.spark.sql.functions._
def applyFunction(column : Column) = when(column === "yes" || column === "present" || column === "normal", lit(1))
.otherwise(when(column === "no" || column === "notpresent" || column === "abnormal", lit(0)).otherwise(column))
df.withColumn("dm", applyFunction(col("dm")))
.withColumn("cad", applyFunction(col("cad")))
.withColumn("rbc", applyFunction(col("rbc")))
.withColumn("pe", applyFunction(col("pe")))
.withColumn("ane", applyFunction(col("ane")))
.show(false)
结果是
+---+----+---+-----+---+---+-----+---+---+--------------+
|id |age |rbc|bgr |dm |cad|appet|pe |ane|classification|
+---+----+---+-----+---+---+-----+---+---+--------------+
|3 |48.0|1 |117.0|0 |0 |poor |1 |1 |ckd |
+---+----+---+-----+---+---+-----+---+---+--------------+
现在问题清楚地表明您不想对所有列重复该过程,因此您可以执行以下操作
val columnsTomap = df.select("rbc", "cad", "rbc", "pe", "ane").columns
var tempdf = df
columnsTomap.map(column => {
tempdf = tempdf.withColumn(column, applyFunction(col(column)))
})
tempdf.show(false)
您也可以使用 foldLeft
函数。让你的 UDF 调用 stringToBinaryUDF
:
import org.apache.spark.sql.functions._
val categoricalColumns = Seq("rbc", "cad", "rbc", "pe", "ane")
val refinedDF = categoricalColumns
.foldLeft(noZeroDF) { (accumulatorDF: DataFrame, columnName: String) =>
accumulatorDF
.withColumn(columnName, stringToBinaryUDF(col(columnName)))
}
这将尊重不变性和函数式编程。