PySpark 中的 Spark 数据透视字符串

Spark Pivot String in PySpark

我在使用 Spark 重组数据时遇到问题。原始数据如下所示:

df = sqlContext.createDataFrame([
    ("ID_1", "VAR_1", "Butter"),
    ("ID_1", "VAR_2", "Toast"),
    ("ID_1", "VAR_3", "Ham"),
    ("ID_2", "VAR_1", "Jam"),
    ("ID_2", "VAR_2", "Toast"),
    ("ID_2", "VAR_3", "Egg"),
], ["ID", "VAR", "VAL"])

>>> df.show()
+----+-----+------+
|  ID|  VAR|   VAL|
+----+-----+------+
|ID_1|VAR_1|Butter|
|ID_1|VAR_2| Toast|
|ID_1|VAR_3|   Ham|
|ID_2|VAR_1|   Jam|
|ID_2|VAR_2| Toast|
|ID_2|VAR_3|   Egg|
+----+-----+------+

这是我尝试实现的结构:

+----+------+-----+-----+
|  ID| VAR_1|VAR_2|VAR_3|
+----+------+-----+-----+
|ID_1|Butter|Toast|  Ham|
|ID_2|   Jam|Toast|  Egg|
+----+------+-----+-----+

我的想法是使用:

df.groupBy("ID").pivot("VAR").show()

但我收到以下错误:

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'GroupedData' object has no attribute 'show'

有什么建议!谢谢!

您需要在pivot() 之后添加一个聚合。如果你确定每个 ("ID", "VAR") 对只有一个 "VAL",你可以使用 first():

from pyspark.sql import functions as f

result = df.groupBy("ID").pivot("VAR").agg(f.first("VAL"))
result.show()

+----+------+-----+-----+
|  ID| VAR_1|VAR_2|VAR_3|
+----+------+-----+-----+
|ID_1|Butter|Toast|  Ham|
|ID_2|   Jam|Toast|  Egg|
+----+------+-----+-----+