如何从pyspark中的数据框创建多个键值对
how to create multiple key-value pair from dataframe in pyspark
我有一个具有以下值的数据框
customer_hash count_beautyhygiene_l3_decile net_paid_amount_l12_decile unique_days_l12_decile
1234 1 3 1
5678 2 3 4
1257 3 2 2
我使用下面的代码为每个 customer_hash
实现键值
df = df.groupBy("customer_hash").agg(collect_list(struct("count_beautyhygiene_l3_decile","net_paid_amount_l12_decile","unique_days_l12_decile")).alias('brandVariable'))
根据上述查询,得到以下结果
customer_hash brandVariable
1234 [{"count_beautyhygiene_l3_decile": 1,
"net_paid_amount_l12_decile": 3,
"unique_days_l12_decile": 1}]
5678 [{"count_beautyhygiene_l3_decile": 2,
"net_paid_amount_l12_decile": 3,
"unique_days_l12_decile": 4}]
1257 [{"count_beautyhygiene_l3_decile": 3,
"net_paid_amount_l12_decile": 2,
"unique_days_l12_decile": 2}]
但我的要求是生成如下格式的输出
customer_hash brandVariable
1234 [{
"NAME": "count_beautyhygiene_l3_decile",
"VALUE": "1"
},
{
"NAME": "net_paid_amount_l12_decile",
"VALUE": "3"
},
{
"NAME": "unique_days_l12_decile",
"VALUE": "1"
}]
5678 [{
"NAME": "count_beautyhygiene_l3_decile",
"VALUE": "2"
},
{
"NAME": "net_paid_amount_l12_decile",
"VALUE": "3"
},
{
"NAME": "unique_days_l12_decile",
"VALUE": "4"
}]...so on
如何实现要求的输出?
尝试如下使用 -
输入数据
data=[(1234,1,3,1),(5678,2,3,4),(1257,3,2,2)]
schema = ["customer_hash","count_beautyhygiene_l3_decile","net_paid_amount_l12_decile","unique_days_l12_decile"]
df = spark.createDataFrame(data=data,schema=schema)
df.show()
+-------------+-----------------------------+--------------------------+----------------------+
|customer_hash|count_beautyhygiene_l3_decile|net_paid_amount_l12_decile|unique_days_l12_decile|
+-------------+-----------------------------+--------------------------+----------------------+
| 1234| 1| 3| 1|
| 5678| 2| 3| 4|
| 1257| 3| 2| 2|
+-------------+-----------------------------+--------------------------+----------------------+
所需输出:
from pyspark.sql.functions import *
from pyspark.sql.types import *
(df.select("customer_hash", to_json(struct("count_beautyhygiene_l3_decile", "net_paid_amount_l12_decile", "unique_days_l12_decile")).alias("temp"))
.select("customer_hash", from_json("temp", MapType(StringType(), IntegerType())))
.select("customer_hash", explode("entries").alias("NAME", "VALUES"))
.select("customer_hash", to_json(struct("NAME", "VALUES")).alias("temp2"))
.groupBy("customer_hash").agg(collect_list("temp2").alias("brandVariable"))
).show(truncate=False)
+-------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+
|customer_hash|brandVariable |
+-------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+
|1234 |[{"NAME":"count_beautyhygiene_l3_decile","VALUES":1}, {"NAME":"net_paid_amount_l12_decile","VALUES":3}, {"NAME":"unique_days_l12_decile","VALUES":1}]|
|5678 |[{"NAME":"count_beautyhygiene_l3_decile","VALUES":2}, {"NAME":"net_paid_amount_l12_decile","VALUES":3}, {"NAME":"unique_days_l12_decile","VALUES":4}]|
|1257 |[{"NAME":"count_beautyhygiene_l3_decile","VALUES":3}, {"NAME":"net_paid_amount_l12_decile","VALUES":2}, {"NAME":"unique_days_l12_decile","VALUES":2}]|
+-------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+
我有一个具有以下值的数据框
customer_hash count_beautyhygiene_l3_decile net_paid_amount_l12_decile unique_days_l12_decile
1234 1 3 1
5678 2 3 4
1257 3 2 2
我使用下面的代码为每个 customer_hash
实现键值df = df.groupBy("customer_hash").agg(collect_list(struct("count_beautyhygiene_l3_decile","net_paid_amount_l12_decile","unique_days_l12_decile")).alias('brandVariable'))
根据上述查询,得到以下结果
customer_hash brandVariable
1234 [{"count_beautyhygiene_l3_decile": 1,
"net_paid_amount_l12_decile": 3,
"unique_days_l12_decile": 1}]
5678 [{"count_beautyhygiene_l3_decile": 2,
"net_paid_amount_l12_decile": 3,
"unique_days_l12_decile": 4}]
1257 [{"count_beautyhygiene_l3_decile": 3,
"net_paid_amount_l12_decile": 2,
"unique_days_l12_decile": 2}]
但我的要求是生成如下格式的输出
customer_hash brandVariable
1234 [{
"NAME": "count_beautyhygiene_l3_decile",
"VALUE": "1"
},
{
"NAME": "net_paid_amount_l12_decile",
"VALUE": "3"
},
{
"NAME": "unique_days_l12_decile",
"VALUE": "1"
}]
5678 [{
"NAME": "count_beautyhygiene_l3_decile",
"VALUE": "2"
},
{
"NAME": "net_paid_amount_l12_decile",
"VALUE": "3"
},
{
"NAME": "unique_days_l12_decile",
"VALUE": "4"
}]...so on
如何实现要求的输出?
尝试如下使用 -
输入数据
data=[(1234,1,3,1),(5678,2,3,4),(1257,3,2,2)]
schema = ["customer_hash","count_beautyhygiene_l3_decile","net_paid_amount_l12_decile","unique_days_l12_decile"]
df = spark.createDataFrame(data=data,schema=schema)
df.show()
+-------------+-----------------------------+--------------------------+----------------------+
|customer_hash|count_beautyhygiene_l3_decile|net_paid_amount_l12_decile|unique_days_l12_decile|
+-------------+-----------------------------+--------------------------+----------------------+
| 1234| 1| 3| 1|
| 5678| 2| 3| 4|
| 1257| 3| 2| 2|
+-------------+-----------------------------+--------------------------+----------------------+
所需输出:
from pyspark.sql.functions import *
from pyspark.sql.types import *
(df.select("customer_hash", to_json(struct("count_beautyhygiene_l3_decile", "net_paid_amount_l12_decile", "unique_days_l12_decile")).alias("temp"))
.select("customer_hash", from_json("temp", MapType(StringType(), IntegerType())))
.select("customer_hash", explode("entries").alias("NAME", "VALUES"))
.select("customer_hash", to_json(struct("NAME", "VALUES")).alias("temp2"))
.groupBy("customer_hash").agg(collect_list("temp2").alias("brandVariable"))
).show(truncate=False)
+-------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+
|customer_hash|brandVariable |
+-------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+
|1234 |[{"NAME":"count_beautyhygiene_l3_decile","VALUES":1}, {"NAME":"net_paid_amount_l12_decile","VALUES":3}, {"NAME":"unique_days_l12_decile","VALUES":1}]|
|5678 |[{"NAME":"count_beautyhygiene_l3_decile","VALUES":2}, {"NAME":"net_paid_amount_l12_decile","VALUES":3}, {"NAME":"unique_days_l12_decile","VALUES":4}]|
|1257 |[{"NAME":"count_beautyhygiene_l3_decile","VALUES":3}, {"NAME":"net_paid_amount_l12_decile","VALUES":2}, {"NAME":"unique_days_l12_decile","VALUES":2}]|
+-------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+