PySpark

Question

我正在尝试将 join 和 groupby 操作链接在一起。我想要执行的输入和操作如下所示。我想要 groupby 除了 agg 中使用的那一列之外的所有列。有没有一种方法可以不列出所有列名，如 groupby("colA","colB")？我尝试了 groupby(df1.*)，但没有用。在这种情况下，我知道我想按 df1 中的所有列进行分组。非常感谢。

Input1:
colA |  ColB
--------------
 A   | 100
 B   | 200

Input2:
colAA |  ColBB
--------------
 A   | Group1
 B   | Group2
 A   | Group2

df1.join(df2, df1colA==df2.colAA,"left").drop("colAA").groupby("colA","colB"),agg(collect_set("colBB"))
 #Is there a way that I do not need to list ("colA","colB") in groupby? there will be many cloumns. 

Output:
 colA |  ColB | collect_set
--------------
 A   | 100    | (Group1,Group2)
 B   | 200    | (Group2)

Answer 1

很简单：

.groupby(df1.columns)

Answer 2

根据您的澄清意见，使用 df1.columns

 df1.join(df2, df1.colA==df2.colAA,"left").drop("colAA").groupby(df1.columns).agg(collect_set("colBB").alias('new')).show()
+----+----+----------------+
|colA|ColB|             new|
+----+----+----------------+
|   A| 100|[Group2, Group1]|
|   B| 200|        [Group2]|
+----+----+----------------+

PySpark - 如何 select 在 groupby 中使用所有列

PySpark - how to select all columns to be used in groupby

python

dataframe

apache-spark-sql