如何从pyspark中的数据框创建多个键值对

how to create multiple key-value pair from dataframe in pyspark

我有一个具有以下值的数据框

customer_hash count_beautyhygiene_l3_decile net_paid_amount_l12_decile unique_days_l12_decile
                          
 1234            1                        3                       1
 5678            2                        3                       4
 1257            3                        2                       2

我使用下面的代码为每个 customer_hash

实现键值
df = df.groupBy("customer_hash").agg(collect_list(struct("count_beautyhygiene_l3_decile","net_paid_amount_l12_decile","unique_days_l12_decile")).alias('brandVariable'))

根据上述查询,得到以下结果

customer_hash      brandVariable
1234              [{"count_beautyhygiene_l3_decile": 1, 
                   "net_paid_amount_l12_decile": 3, 
                  "unique_days_l12_decile": 1}]
5678              [{"count_beautyhygiene_l3_decile": 2, 
                   "net_paid_amount_l12_decile": 3, 
                   "unique_days_l12_decile": 4}]      
1257              [{"count_beautyhygiene_l3_decile": 3, 
                   "net_paid_amount_l12_decile": 2, 
                   "unique_days_l12_decile": 2}]

但我的要求是生成如下格式的输出

customer_hash      brandVariable
1234              [{
                   "NAME": "count_beautyhygiene_l3_decile",      
                   "VALUE": "1"
                  },
                  {
                  "NAME": "net_paid_amount_l12_decile",           
                  "VALUE": "3"
                  },
                  {
                  "NAME": "unique_days_l12_decile",
                  "VALUE": "1"
                  }]


5678              [{
                   "NAME": "count_beautyhygiene_l3_decile",
                   "VALUE": "2"
                  },
                  {
                  "NAME": "net_paid_amount_l12_decile", 
                  "VALUE": "3"
                  },
                  {
                  "NAME": "unique_days_l12_decile",
                  "VALUE": "4"
                  }]...so on

如何实现要求的输出?

尝试如下使用 -

输入数据

data=[(1234,1,3,1),(5678,2,3,4),(1257,3,2,2)]
schema = ["customer_hash","count_beautyhygiene_l3_decile","net_paid_amount_l12_decile","unique_days_l12_decile"]

df = spark.createDataFrame(data=data,schema=schema)
df.show()

+-------------+-----------------------------+--------------------------+----------------------+
|customer_hash|count_beautyhygiene_l3_decile|net_paid_amount_l12_decile|unique_days_l12_decile|
+-------------+-----------------------------+--------------------------+----------------------+
|         1234|                            1|                         3|                     1|
|         5678|                            2|                         3|                     4|
|         1257|                            3|                         2|                     2|
+-------------+-----------------------------+--------------------------+----------------------+

所需输出:

from pyspark.sql.functions import *
from pyspark.sql.types import *

(df.select("customer_hash", to_json(struct("count_beautyhygiene_l3_decile", "net_paid_amount_l12_decile", "unique_days_l12_decile")).alias("temp"))
   .select("customer_hash", from_json("temp", MapType(StringType(), IntegerType())))
   .select("customer_hash", explode("entries").alias("NAME", "VALUES"))
   .select("customer_hash", to_json(struct("NAME", "VALUES")).alias("temp2"))
   .groupBy("customer_hash").agg(collect_list("temp2").alias("brandVariable"))
 ).show(truncate=False)

+-------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+
|customer_hash|brandVariable                                                                                                                                        |
+-------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+
|1234         |[{"NAME":"count_beautyhygiene_l3_decile","VALUES":1}, {"NAME":"net_paid_amount_l12_decile","VALUES":3}, {"NAME":"unique_days_l12_decile","VALUES":1}]|
|5678         |[{"NAME":"count_beautyhygiene_l3_decile","VALUES":2}, {"NAME":"net_paid_amount_l12_decile","VALUES":3}, {"NAME":"unique_days_l12_decile","VALUES":4}]|
|1257         |[{"NAME":"count_beautyhygiene_l3_decile","VALUES":3}, {"NAME":"net_paid_amount_l12_decile","VALUES":2}, {"NAME":"unique_days_l12_decile","VALUES":2}]|
+-------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+