使用 case class 重命名拆分列与 Spark Dataframe
Using a case class to rename split columns with Spark Dataframe
我按照以下代码将 'split_column' 拆分为另外五列。但是我想重命名这个新列,以便它们有一些有意义的名称(比如 new_renamed1"、"new_renamed2"、"new_renamed3"、"new_renamed4"、"new_renamed5" 在这个例子中)
val df1 = df.withColumn("new_column", split(col("split_column"), "\|")).select(col("*") +: (0 until 5).map(i => col("new_column").getItem(i).as(s"newcol$i")): _*).drop("split_column","new_column")
val new_columns_renamed = Seq("....., "new_renamed1", "new_renamed2", "new_renamed3", "new_renamed4", "new_renamed5")
val df2 = df1.toDF(new_columns_renamed: _*)
然而,这种方法的问题是我的一些拆分可能有五十多个新行。在这种重命名方法中,一个小错字(如多余的逗号、缺少双引号)会很难检测到。
有没有办法重命名大小写为 class 的列,如下所示?
case class SplittedRecord (new_renamed1: String, new_renamed2: String, new_renamed3: String, new_renamed4: String, new_renamed5: String)
请注意,在实际场景中名称不会像new_renamed1、new_renamed2、......、new_renamed5,它们会完全不同。
您可以尝试这样的操作:
import org.apache.spark.sql.catalyst.ScalaReflection
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.Encoders
val names = Encoders.product[SplittedRecord].schema.fieldNames
names.zipWithIndex
.foldLeft(df.withColumn("new_column", split(col("split_column"), "\|")))
{ case (df, (c, i)) => df.withColumn(c, $"new_column"(i)) }
使用大小写的方法之一class
case class SplittedRecord (new_renamed1: String, new_renamed2: String, new_renamed3: String, new_renamed4: String, new_renamed5: String)
是通过udf
函数作为
import org.apache.spark.sql.functions._
def splitUdf = udf((array: Seq[String])=> SplittedRecord(array(0), array(1), array(2), array(3), array(4)))
df.withColumn("test", splitUdf(split(col("split_column"), "\|"))).drop("split_column")
.select(col("*"), col("test.*")).drop("test")
我按照以下代码将 'split_column' 拆分为另外五列。但是我想重命名这个新列,以便它们有一些有意义的名称(比如 new_renamed1"、"new_renamed2"、"new_renamed3"、"new_renamed4"、"new_renamed5" 在这个例子中)
val df1 = df.withColumn("new_column", split(col("split_column"), "\|")).select(col("*") +: (0 until 5).map(i => col("new_column").getItem(i).as(s"newcol$i")): _*).drop("split_column","new_column")
val new_columns_renamed = Seq("....., "new_renamed1", "new_renamed2", "new_renamed3", "new_renamed4", "new_renamed5")
val df2 = df1.toDF(new_columns_renamed: _*)
然而,这种方法的问题是我的一些拆分可能有五十多个新行。在这种重命名方法中,一个小错字(如多余的逗号、缺少双引号)会很难检测到。
有没有办法重命名大小写为 class 的列,如下所示?
case class SplittedRecord (new_renamed1: String, new_renamed2: String, new_renamed3: String, new_renamed4: String, new_renamed5: String)
请注意,在实际场景中名称不会像new_renamed1、new_renamed2、......、new_renamed5,它们会完全不同。
您可以尝试这样的操作:
import org.apache.spark.sql.catalyst.ScalaReflection
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.Encoders
val names = Encoders.product[SplittedRecord].schema.fieldNames
names.zipWithIndex
.foldLeft(df.withColumn("new_column", split(col("split_column"), "\|")))
{ case (df, (c, i)) => df.withColumn(c, $"new_column"(i)) }
使用大小写的方法之一class
case class SplittedRecord (new_renamed1: String, new_renamed2: String, new_renamed3: String, new_renamed4: String, new_renamed5: String)
是通过udf
函数作为
import org.apache.spark.sql.functions._
def splitUdf = udf((array: Seq[String])=> SplittedRecord(array(0), array(1), array(2), array(3), array(4)))
df.withColumn("test", splitUdf(split(col("split_column"), "\|"))).drop("split_column")
.select(col("*"), col("test.*")).drop("test")