Pyspark dataframe 列包含字典数组,想要将字典中的每个键都变成一列
Pyspark dataframe column contains array of dictionaries, want to make each key from dictionary into a column
我目前有这样一个数据框:
+-------+-------+-------+-------+
| Id |value_list_of_dicts |
+-------+-------+-------+-------+
| 1 |[{"val1":0, "val2":0}, |
| |{"val1":2, "val2":5}] |
+-------+-------+-------+-------+
| 2 |[{"val1":9, "val2":10},|
| |{"val1":1, "val2":2}] |
+-------+-------+-------+-------+
每个列表恰好包含 30 个字典,值可能不同,但键名始终相同。我希望我的数据框看起来像这样:
+-------+-------+-------+
| Id |val1 |val2 |
+-------+-------+-------+
| 1 | 0 | 0 |
+-------+-------+-------+
| 1 | 2 | 5 |
+-------+-------+-------+
| 2 | 9 | 10 |
+-------+-------+-------+
| 2 | 1 | 2 |
+-------+-------+-------+
最好的方法是什么?
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import *
from datetime import datetime
from pyspark.sql import *
from collections import *
from pyspark.sql.functions import udf,explode
from pyspark.sql.types import StringType
df= spark.createDataFrame(
[
(1, [{"val1":0, "val2":0},{"val1":2, "val2":5}]),
(2, [{"val1":9, "val2":10},{"val1":1, "val2":2}])
],("ID","List")
)
df2 = df.select(df.ID,explode(df.List).alias("Column1") )
df2.withColumn("Val1", F.col("Column1").getItem("val1")).withColumn("Val2", F.col("Column1").getItem("val2")).show(truncate=False)
输出:
+---+-----------------------+----+----+
|ID |Column1 |Val1|Val2|
+---+-----------------------+----+----+
|1 |[val2 -> 0, val1 -> 0] |0 |0 |
|1 |[val2 -> 5, val1 -> 2] |2 |5 |
|2 |[val2 -> 10, val1 -> 9]|9 |10 |
|2 |[val2 -> 2, val1 -> 1] |1 |2 |
+---+--------------------+----+----+
我目前有这样一个数据框:
+-------+-------+-------+-------+
| Id |value_list_of_dicts |
+-------+-------+-------+-------+
| 1 |[{"val1":0, "val2":0}, |
| |{"val1":2, "val2":5}] |
+-------+-------+-------+-------+
| 2 |[{"val1":9, "val2":10},|
| |{"val1":1, "val2":2}] |
+-------+-------+-------+-------+
每个列表恰好包含 30 个字典,值可能不同,但键名始终相同。我希望我的数据框看起来像这样:
+-------+-------+-------+
| Id |val1 |val2 |
+-------+-------+-------+
| 1 | 0 | 0 |
+-------+-------+-------+
| 1 | 2 | 5 |
+-------+-------+-------+
| 2 | 9 | 10 |
+-------+-------+-------+
| 2 | 1 | 2 |
+-------+-------+-------+
最好的方法是什么?
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import *
from datetime import datetime
from pyspark.sql import *
from collections import *
from pyspark.sql.functions import udf,explode
from pyspark.sql.types import StringType
df= spark.createDataFrame(
[
(1, [{"val1":0, "val2":0},{"val1":2, "val2":5}]),
(2, [{"val1":9, "val2":10},{"val1":1, "val2":2}])
],("ID","List")
)
df2 = df.select(df.ID,explode(df.List).alias("Column1") )
df2.withColumn("Val1", F.col("Column1").getItem("val1")).withColumn("Val2", F.col("Column1").getItem("val2")).show(truncate=False)
输出:
+---+-----------------------+----+----+
|ID |Column1 |Val1|Val2|
+---+-----------------------+----+----+
|1 |[val2 -> 0, val1 -> 0] |0 |0 |
|1 |[val2 -> 5, val1 -> 2] |2 |5 |
|2 |[val2 -> 10, val1 -> 9]|9 |10 |
|2 |[val2 -> 2, val1 -> 1] |1 |2 |
+---+--------------------+----+----+