PySpark 中的 Spark 数据透视字符串
Spark Pivot String in PySpark
我在使用 Spark 重组数据时遇到问题。原始数据如下所示:
df = sqlContext.createDataFrame([
("ID_1", "VAR_1", "Butter"),
("ID_1", "VAR_2", "Toast"),
("ID_1", "VAR_3", "Ham"),
("ID_2", "VAR_1", "Jam"),
("ID_2", "VAR_2", "Toast"),
("ID_2", "VAR_3", "Egg"),
], ["ID", "VAR", "VAL"])
>>> df.show()
+----+-----+------+
| ID| VAR| VAL|
+----+-----+------+
|ID_1|VAR_1|Butter|
|ID_1|VAR_2| Toast|
|ID_1|VAR_3| Ham|
|ID_2|VAR_1| Jam|
|ID_2|VAR_2| Toast|
|ID_2|VAR_3| Egg|
+----+-----+------+
这是我尝试实现的结构:
+----+------+-----+-----+
| ID| VAR_1|VAR_2|VAR_3|
+----+------+-----+-----+
|ID_1|Butter|Toast| Ham|
|ID_2| Jam|Toast| Egg|
+----+------+-----+-----+
我的想法是使用:
df.groupBy("ID").pivot("VAR").show()
但我收到以下错误:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'GroupedData' object has no attribute 'show'
有什么建议!谢谢!
您需要在pivot() 之后添加一个聚合。如果你确定每个 ("ID", "VAR") 对只有一个 "VAL",你可以使用 first():
from pyspark.sql import functions as f
result = df.groupBy("ID").pivot("VAR").agg(f.first("VAL"))
result.show()
+----+------+-----+-----+
| ID| VAR_1|VAR_2|VAR_3|
+----+------+-----+-----+
|ID_1|Butter|Toast| Ham|
|ID_2| Jam|Toast| Egg|
+----+------+-----+-----+
我在使用 Spark 重组数据时遇到问题。原始数据如下所示:
df = sqlContext.createDataFrame([
("ID_1", "VAR_1", "Butter"),
("ID_1", "VAR_2", "Toast"),
("ID_1", "VAR_3", "Ham"),
("ID_2", "VAR_1", "Jam"),
("ID_2", "VAR_2", "Toast"),
("ID_2", "VAR_3", "Egg"),
], ["ID", "VAR", "VAL"])
>>> df.show()
+----+-----+------+
| ID| VAR| VAL|
+----+-----+------+
|ID_1|VAR_1|Butter|
|ID_1|VAR_2| Toast|
|ID_1|VAR_3| Ham|
|ID_2|VAR_1| Jam|
|ID_2|VAR_2| Toast|
|ID_2|VAR_3| Egg|
+----+-----+------+
这是我尝试实现的结构:
+----+------+-----+-----+
| ID| VAR_1|VAR_2|VAR_3|
+----+------+-----+-----+
|ID_1|Butter|Toast| Ham|
|ID_2| Jam|Toast| Egg|
+----+------+-----+-----+
我的想法是使用:
df.groupBy("ID").pivot("VAR").show()
但我收到以下错误:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'GroupedData' object has no attribute 'show'
有什么建议!谢谢!
您需要在pivot() 之后添加一个聚合。如果你确定每个 ("ID", "VAR") 对只有一个 "VAL",你可以使用 first():
from pyspark.sql import functions as f
result = df.groupBy("ID").pivot("VAR").agg(f.first("VAL"))
result.show()
+----+------+-----+-----+
| ID| VAR_1|VAR_2|VAR_3|
+----+------+-----+-----+
|ID_1|Butter|Toast| Ham|
|ID_2| Jam|Toast| Egg|
+----+------+-----+-----+