如何修改pyspark数据框嵌套结构列
How to modify pyspark dataframe nested struct column
我正在尝试 anonymize/hash 嵌套列,但没有成功。架构看起来像这样:
-- abc: struct (nullable = true)
| |-- xyz: struct (nullable = true)
| | |-- abc123: string (nullable = true)
| | |-- services: struct (nullable = true)
| | | |-- service: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- type: string (nullable = true)
| | | | | |-- subtype: string (nullable = true)
我需要更改 type
列的 (anonymize/hash) 值。
对于 Spark 3.1+,有一个列方法 withField
可用于更新结构字段。
假设这是您的输入数据框(对应于您提供的架构):
from pyspark.sql import Row
df = spark.createDataFrame([
Row(abc=Row(xyz=Row(abc123="value123", services=[Row(type="type1", subtype="subtype1")])))
])
df.show(truncate=False)
#+---------------------------------+
#|abc |
#+---------------------------------+
#|{{value123, [{type1, subtype1}]}}|
#+---------------------------------+
你可以使用 transform
on the array services
to hash the field type
for each struct element (here I used xxhash64
函数来实现,例如:
import pyspark.sql.functions as F
df2 = df.withColumn(
"abc",
F.col("abc").withField(
"xyz",
F.col("abc.xyz").withField(
"services",
F.expr("transform(abc.xyz.services, x -> struct(xxhash64(x.type) as type, x.subtype))")
)
)
)
df2.show(truncate=False)
#+-----------------------------------------------+
#|abc |
#+-----------------------------------------------+
#|{{value123, [{2134479862461603894, subtype1}]}}|
#+-----------------------------------------------+
对于较旧的 Spark 版本,您需要重新创建整个结构以更新字段,这在有许多嵌套字段时会变得乏味。在你的情况下它会是这样的:
df2 = df.withColumn(
"abc",
F.struct(
F.struct(
F.col("abc.xyz.abc123"),
F.expr(
"transform(abc.xyz.services, x -> struct(xxhash64(x.type) as type, x.subtype))"
).alias("services")
).alias("xyz")
)
)
我正在尝试 anonymize/hash 嵌套列,但没有成功。架构看起来像这样:
-- abc: struct (nullable = true)
| |-- xyz: struct (nullable = true)
| | |-- abc123: string (nullable = true)
| | |-- services: struct (nullable = true)
| | | |-- service: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- type: string (nullable = true)
| | | | | |-- subtype: string (nullable = true)
我需要更改 type
列的 (anonymize/hash) 值。
对于 Spark 3.1+,有一个列方法 withField
可用于更新结构字段。
假设这是您的输入数据框(对应于您提供的架构):
from pyspark.sql import Row
df = spark.createDataFrame([
Row(abc=Row(xyz=Row(abc123="value123", services=[Row(type="type1", subtype="subtype1")])))
])
df.show(truncate=False)
#+---------------------------------+
#|abc |
#+---------------------------------+
#|{{value123, [{type1, subtype1}]}}|
#+---------------------------------+
你可以使用 transform
on the array services
to hash the field type
for each struct element (here I used xxhash64
函数来实现,例如:
import pyspark.sql.functions as F
df2 = df.withColumn(
"abc",
F.col("abc").withField(
"xyz",
F.col("abc.xyz").withField(
"services",
F.expr("transform(abc.xyz.services, x -> struct(xxhash64(x.type) as type, x.subtype))")
)
)
)
df2.show(truncate=False)
#+-----------------------------------------------+
#|abc |
#+-----------------------------------------------+
#|{{value123, [{2134479862461603894, subtype1}]}}|
#+-----------------------------------------------+
对于较旧的 Spark 版本,您需要重新创建整个结构以更新字段,这在有许多嵌套字段时会变得乏味。在你的情况下它会是这样的:
df2 = df.withColumn(
"abc",
F.struct(
F.struct(
F.col("abc.xyz.abc123"),
F.expr(
"transform(abc.xyz.services, x -> struct(xxhash64(x.type) as type, x.subtype))"
).alias("services")
).alias("xyz")
)
)