Spark DF 列到字符串 JSON
Spark DF column to string JSON
我有一个这样的DF:
+------------+-------------------------------------------------------------+
|pk_attr_name|pk_struct |
+------------+-------------------------------------------------------------+
|CLNT_GRP_CD |{"pk_seq":1,"pk_attr_id":20209,"pk_attr_name":"CLNT_GRP_CD"} |
|IDI_CONTRACT|{"pk_seq":2,"pk_attr_id":45483,"pk_attr_name":"IDI_CONTRACT"}|
+------------+-------------------------------------------------------------+
我想从 pk_struct 列定义一个 JSON 字符串。期望的输出:
pk_struct_str = '[{"pk_seq":1,"pk_attr_id":20209,"pk_attr_name":"CLNT_GRP_CD"},{"pk_seq":2,"pk_attr_id":45483,"pk_attr_name":"IDI_CONTRACT"}]'
我试过了:
pk_df.select(F.to_json(F.struct("pk_struct")).alias("json")).show(truncate=False)
但没有给我想要的结果
pk_df.printSchema()
root
|-- pk_attr_name: string (nullable = true)
|-- pk_struct: string (nullable = true)
您可以使用 collect_list 或 collect_set function.But 实现此结果,它可以与聚合函数一起使用。因此创建了虚拟列并按该列值分组并在聚合中使用 collect_list 函数
df.show(2,False)
df1 = df.withColumn("dummy",lit("XXX"))
df2 = df1.groupBy("dummy").agg(collect_list(df1.pk_struct))
df2.show(2,False)
+------------+-------------------------------------------------------------+
|pk_attr_name|pk_struct |
+------------+-------------------------------------------------------------+
|CLNT_GRP_CD |{"pk_seq":1,"pk_attr_id":20209,"pk_attr_name":"CLNT_GRP_CD"} |
|IDI_CONTRACT|{"pk_seq":2,"pk_attr_id":45483,"pk_attr_name":"IDI_CONTRACT"}|
+------------+-------------------------------------------------------------+
+-----+-----------------------------------------------------------------------------------------------------------------------------+
|dummy|collect_list(pk_struct) |
+-----+-----------------------------------------------------------------------------------------------------------------------------+
|XXX |[{"pk_seq":1,"pk_attr_id":20209,"pk_attr_name":"CLNT_GRP_CD"}, {"pk_seq":2,"pk_attr_id":45483,"pk_attr_name":"IDI_CONTRACT"}]|
+-----+-----------------------------------------------------------------------------------------------------------------------------+
我有一个这样的DF:
+------------+-------------------------------------------------------------+
|pk_attr_name|pk_struct |
+------------+-------------------------------------------------------------+
|CLNT_GRP_CD |{"pk_seq":1,"pk_attr_id":20209,"pk_attr_name":"CLNT_GRP_CD"} |
|IDI_CONTRACT|{"pk_seq":2,"pk_attr_id":45483,"pk_attr_name":"IDI_CONTRACT"}|
+------------+-------------------------------------------------------------+
我想从 pk_struct 列定义一个 JSON 字符串。期望的输出:
pk_struct_str = '[{"pk_seq":1,"pk_attr_id":20209,"pk_attr_name":"CLNT_GRP_CD"},{"pk_seq":2,"pk_attr_id":45483,"pk_attr_name":"IDI_CONTRACT"}]'
我试过了:
pk_df.select(F.to_json(F.struct("pk_struct")).alias("json")).show(truncate=False)
但没有给我想要的结果
pk_df.printSchema()
root
|-- pk_attr_name: string (nullable = true)
|-- pk_struct: string (nullable = true)
您可以使用 collect_list 或 collect_set function.But 实现此结果,它可以与聚合函数一起使用。因此创建了虚拟列并按该列值分组并在聚合中使用 collect_list 函数
df.show(2,False)
df1 = df.withColumn("dummy",lit("XXX"))
df2 = df1.groupBy("dummy").agg(collect_list(df1.pk_struct))
df2.show(2,False)
+------------+-------------------------------------------------------------+
|pk_attr_name|pk_struct |
+------------+-------------------------------------------------------------+
|CLNT_GRP_CD |{"pk_seq":1,"pk_attr_id":20209,"pk_attr_name":"CLNT_GRP_CD"} |
|IDI_CONTRACT|{"pk_seq":2,"pk_attr_id":45483,"pk_attr_name":"IDI_CONTRACT"}|
+------------+-------------------------------------------------------------+
+-----+-----------------------------------------------------------------------------------------------------------------------------+
|dummy|collect_list(pk_struct) |
+-----+-----------------------------------------------------------------------------------------------------------------------------+
|XXX |[{"pk_seq":1,"pk_attr_id":20209,"pk_attr_name":"CLNT_GRP_CD"}, {"pk_seq":2,"pk_attr_id":45483,"pk_attr_name":"IDI_CONTRACT"}]|
+-----+-----------------------------------------------------------------------------------------------------------------------------+