将数据框的架构更改为其他架构

Change schema of dataframe to other schema

我有一个看起来像这样的数据框

df.printSchema()

root
 |-- id: integer (nullable = true)
 |-- data: struct (nullable = true)
 |    |-- foo01 string (nullable = true)
 |    |-- bar01 string (nullable = true)
 |    |-- foo02 string (nullable = true)
 |    |-- bar02 string (nullable = true)

我想将其转换为

root
 |-- id: integer (nullable = true)
 |-- foo: struct (nullable = true)
 |    |-- foo01 string (nullable = true)
 |    |-- foo02 string (nullable = true)
 |-- bar: struct (nullable = true)
 |    |-- bar01 string (nullable = true)
 |    |-- bar02 string (nullable = true)

解决此问题的最佳方法是什么?

您可以简单地使用 struct Pyspark 函数。

from pyspark.sql.functions import struct

new_df = df.select(
  'id',
  struct('data.foo01', 'data.foo02').alias('foo'),
  struct('data.bar01', 'data.bar02').alias('bar'),
)

与 struct Pyspark 函数相关的附加说明:它可以采用字符串列名列表来仅将列移动到结构中,或者如果您需要表达式列表。

您可以使用带有 select 的结构函数,如下所示:

from pyspark.sql import functions as F

finalDF = df.select( "id",
                     F.struct("data.foo01", "data.foo02").alias("foo"),
                     F.struct("data.bar01", "data.bar02").alias("bar")
                     )


finalDF.printSchema

架构:

root
 |-- id: string (nullable = true)
 |-- foo: struct (nullable = false)
 |    |-- foo01: string (nullable = true)
 |    |-- foo02: string (nullable = true)
 |-- bar: struct (nullable = false)
 |    |-- bar01: string (nullable = true)
 |    |-- bar02: string (nullable = true)