在 pyspark 中对齐多个数据框

Align multiple dataframes in pyspark

我有这 4 个 spark 数据帧:

order,device,count_1
101,201,2
102,202,4

order,device,count_2
101,201,10
103,203,100

order,device,count_3
104,204,111
103,203,10

order,device,count_4
101,201,4
104,204,11

我想创建一个结果数据框:

order,device,count_1,count_2,count_3,count_4
101,201,2,10,,4,
102,202,4,,,,
103,203,,100,10,,
104,204,,,111,11

这是UNION or JOIN or APPEND的情况吗?如何得到最终的df?

您可以将 UNION 视为按 合并表格,因此行数可能会增加。 JOIN 合并表格。我不确定 APPEND 是什么意思,但在这种情况下,您需要 JOIN.

尝试:

val df1 = Seq((101,201,2), (102,202,4)).toDF("order" ,"device", "count_1")
val df2 = Seq((101,201,10), (103,203,100)).toDF("order" ,"device", "count_2")
val df3 = Seq((104,204,111), (103,203,10)).toDF("order" ,"device", "count_3")
val df4 = Seq((101,201,4), (104,204,11)).toDF("order" ,"device", "count_4")

val df12 = df1.join(df2, Seq("order", "device"),"fullouter")
df12.show(false)
val df123 = df12.join(df3, Seq("order", "device"),"fullouter")
df123.show(false)
val df1234 = df123.join(df4, Seq("order", "device"),"fullouter")
df1234.show(false)

returns:

+-----+------+-------+-------+-------+-------+
|order|device|count_1|count_2|count_3|count_4|
+-----+------+-------+-------+-------+-------+
|101  |201   |2      |10     |null   |4      |
|102  |202   |4      |null   |null   |null   |
|103  |203   |null   |100    |10     |null   |
|104  |204   |null   |null   |111    |11     |
+-----+------+-------+-------+-------+-------+

如您所见,评论存在缺陷,第一个答案不正确。

在 Scala 中做过,在 pyspark 中应该很容易做到。