如何为pyspark中的嵌套对象分配常量值？

Question

我有一个要求，我需要屏蔽给定模式中某些字段的数据。我研究了很多，但找不到所需的答案。这是我需要对字段进行一些更改的模式（answer_type，response0，response3）：

|    |-- choices: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- choice_id: long (nullable = true)
 |    |    |    |-- created_time: long (nullable = true)
 |    |    |    |-- updated_time: long (nullable = true)
 |    |    |    |-- created_by: long (nullable = true)
 |    |    |    |-- updated_by: long (nullable = true)
 |    |    |    |-- answers: struct (nullable = true)
 |    |    |    |    |-- answer_node_internal_id: long (nullable = true)
 |    |    |    |    |-- label: string (nullable = true)
 |    |    |    |    |-- text: map (nullable = true)
 |    |    |    |    |    |-- key: string
 |    |    |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |    |    |-- data_tag: string (nullable = true)
 |    |    |    |    |-- answer_type: string (nullable = true)
 |    |    |    |-- response: struct (nullable = true)
 |    |    |    |    |-- response0: string (nullable = true)
 |    |    |    |    |-- response1: long (nullable = true)
 |    |    |    |    |-- response2: double (nullable = true)
 |    |    |    |    |-- response3: array (nullable = true)
 |    |    |    |    |    |-- element: string (containsNull = true)

有没有一种方法可以在不影响 pyspark 中的上述结构的情况下为这些字段赋值？

我试过使用 explode，但我无法恢复到原始模式。我也不想创建新列，同时不想丢失提供的架构对象中的任何数据。

Answer 1

哦，我几天前遇到了类似的问题，我建议将结构类型转换为 json 然后使用 udf 进行内部更改在你无法再次获得原始结构之后

您应该从文档中看到 to_json 和 from_json。

https://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html#pyspark.sql.functions.from_json

https://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html#pyspark.sql.functions.to_json

如何为pyspark中的嵌套对象分配常量值？

How to assign constant values to the nested objects in pyspark?

apache-spark-sql

pyspark

spark-avro