PySpark DataFrame groupby 到值列表中?

PySpark DataFrame groupby into list of values?

简单地说,假设我有以下 DataFrame:

+-------------+----------+------+
|employee_name|department|salary|
+-------------+----------+------+
|        James|     Sales|  3000|
|      Michael|     Sales|  4600|
|       Robert|     Sales|  4100|
|        Maria|   Finance|  3000|
|        James|     Sales|  3000|
|        Scott|   Finance|  3300|
|          Jen|   Finance|  3900|
|         Jeff| Marketing|  3000|
|        Kumar| Marketing|  2000|
|         Saif|     Sales|  4100|
+-------------+----------+------+

如何按部门分组并将所有其他值放入列表中,如下所示:

department employee_name salary
Sales [James, Michael, Robert, James, Saif] [3000, 4600, 4100, 3000, 4100]
Finance [Maria, Scott, Jen] [3000, 3300, 3900]
Marketing [Jeff, Kumar] [3000, 2000]

collect_listgroupBy 子句一起使用

from pyspark.sql.functions import *

df.groupBy(col("department")).agg(collect_list(col("employee_name")).alias("employee_name"),collect_list(col("employee_name")).alias("salary"))

让我们尝试最少的输入;

df.groupby('department').agg(*[collect_list(c).alias(c) for c in df.drop('department').columns]).show()