如何更改 StructType 或 ArrayType 列中的所有列数据类型?

How to change all columns data types in StructType or ArrayType columns?

我有一个 DataFrame,其中包含一些带有 StructTypeArrayType 的列。我想将所有 IntegerType 列转换为 DoubleType。我找到了解决这个问题的一些方法。例如 做的事情与我想要的相似。但问题是,它不会更改嵌套在 StructTypeArrayType 列中的列的数据类型。

例如,我有一个具有以下架构的 DataFrame:

 |-- carCategories: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- payerId: integer (nullable = true)
 |-- percentage: integer (nullable = true)
 |-- plateNumberStatus: string (nullable = true)
 |-- ratio: struct (nullable = true)
 |    |-- max: integer (nullable = true)
 |    |-- min: integer (nullable = true)

执行以下脚本后:

val doubleSchema = df.schema.fields.map{f =>
  f match{
    case StructField(name:String, _:IntegerType, _, _) => col(name).cast(DoubleType)
    case _ => col(f.name)
  }
}

df.select(doubleSchema:_*).printSchema

结果是这样的:

 |-- carCategories: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- payerId: double (nullable = true)
 |-- percentage: double (nullable = true)
 |-- plateNumberStatus: string (nullable = true)
 |-- ratio: struct (nullable = true)
 |    |-- max: integer (nullable = true)
 |    |-- min: integer (nullable = true)

如您所见,某些列已转换为 DoubleType,但 ArrayTypeStructType 中的列未转换。

我希望最终架构是这样的:

|-- carCategories: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- payerId: double (nullable = true)
 |-- percentage: double (nullable = true)
 |-- plateNumberStatus: string (nullable = true)
 |-- ratio: struct (nullable = true)
 |    |-- max: double (nullable = true)
 |    |-- min: double (nullable = true)

我怎样才能实现这样的目标?

提前致谢

您可以添加 case 子句来处理 ArrayTypeStructType,如下所示:

def castIntToDouble(schema: StructType): Seq[Column] = {
  schema.fields.map { f =>
    f.dataType match {
      case IntegerType => col(f.name).cast(DoubleType)
      case StructType(_) =>
        col(f.name).cast(
          f.dataType.simpleString.replace(s":${IntegerType.simpleString}", s":${DoubleType.simpleString}")
        )
      case dt: ArrayType =>
        dt.elementType match {
          case IntegerType => col(f.name).cast(ArrayType(DoubleType))
          case StructType(_) =>
            col(f.name).cast(
              f.dataType.simpleString.replace(s":${IntegerType.simpleString}",s":${DoubleType.simpleString}")
            )
          case _ => col(f.name)
        }
      case _ => col(f.name)
    }
  }
}

当列的类型为 StructType 或嵌套结构数组时,该函数使用 DLL 字符串格式的转换。例如如果您必须强制转换具有类型 struct<max:int,min:int> 的结构列 ratio 而不必重新创建整个结构,您将执行以下操作:

df.withColumn("ratio", col("ratio").cast("struct<max:double,min:double>"))

现在将其应用到您的输入示例中:

val df = (
   Seq((Seq(1, 2, 3), 34, 87, "pending", (65, 22)))
  .toDF("carCategories","payerId","percentage","plateNumberStatus","ratio")
  .withColumn("ratio", col("ratio").cast("struct<max:int,min:int>"))
)

df.select(castIntToDouble(df.schema):_*).printSchema
//root
// |-- carCategories: array (nullable = true)
// |    |-- element: double (containsNull = true)
// |-- payerId: double (nullable = false)
// |-- percentage: double (nullable = false)
// |-- plateNumberStatus: string (nullable = true)
// |-- ratio: struct (nullable = true)
// |    |-- max: double (nullable = true)
// |    |-- min: double (nullable = true)