根据 PySpark 中的条件从特定列中的数据形成多列

Question

我有一个像

这样的spark daframe

item_id   attribute_id  attribute_value
1001      color          blue
1001      shape          rectangular
1001      material       copper
1002      color          black
1002      material       copper
1003      color          grey

我希望结果数据框与下方匹配

item_id   color    shape        meterial 
1001      blue     rectangular  copper
1002      black    null         copper
1003      grey     null         null

我正在尝试在 PySpark 中实现这一点，但不确定语法，有什么提示吗？注意：感谢 PySpark 或 spark sql 中的任何指针

Answer 1

您要查找的操作是pivot。您应该按 item_id 对数据帧进行分组，然后根据 attribute_id pivot 分组，最后选择 attribute_value.

的第一个值

data = [(1001, "color", "blue",),
        (1001, "shape", "rectangular",),
        (1001, "material", "copper",),
        (1002, "color", "black",),
        (1002, "material", "copper",),
        (1003, "color", "grey",), ]

df = spark.createDataFrame(data, ("item_id", "attribute_id", "attribute_value",))

df.groupBy("item_id").pivot("attribute_id").agg(F.first("attribute_value")).show()

"""
+-------+-----+--------+-----------+
|item_id|color|material|      shape|
+-------+-----+--------+-----------+
|   1001| blue|  copper|rectangular|
|   1002|black|  copper|       null|
|   1003| grey|    null|       null|
+-------+-----+--------+-----------+
"""

根据 PySpark 中的条件从特定列中的数据形成多列

Forming multiple columns from data in a perticular column based on condition in PySpark

python

dataframe

apache-spark

apache-spark-sql

pyspark