PySpark DataFrame groupby 到值列表中?
PySpark DataFrame groupby into list of values?
简单地说,假设我有以下 DataFrame:
+-------------+----------+------+
|employee_name|department|salary|
+-------------+----------+------+
| James| Sales| 3000|
| Michael| Sales| 4600|
| Robert| Sales| 4100|
| Maria| Finance| 3000|
| James| Sales| 3000|
| Scott| Finance| 3300|
| Jen| Finance| 3900|
| Jeff| Marketing| 3000|
| Kumar| Marketing| 2000|
| Saif| Sales| 4100|
+-------------+----------+------+
如何按部门分组并将所有其他值放入列表中,如下所示:
department
employee_name
salary
Sales
[James, Michael, Robert, James, Saif]
[3000, 4600, 4100, 3000, 4100]
Finance
[Maria, Scott, Jen]
[3000, 3300, 3900]
Marketing
[Jeff, Kumar]
[3000, 2000]
将 collect_list
与 groupBy
子句一起使用
from pyspark.sql.functions import *
df.groupBy(col("department")).agg(collect_list(col("employee_name")).alias("employee_name"),collect_list(col("employee_name")).alias("salary"))
让我们尝试最少的输入;
df.groupby('department').agg(*[collect_list(c).alias(c) for c in df.drop('department').columns]).show()
简单地说,假设我有以下 DataFrame:
+-------------+----------+------+
|employee_name|department|salary|
+-------------+----------+------+
| James| Sales| 3000|
| Michael| Sales| 4600|
| Robert| Sales| 4100|
| Maria| Finance| 3000|
| James| Sales| 3000|
| Scott| Finance| 3300|
| Jen| Finance| 3900|
| Jeff| Marketing| 3000|
| Kumar| Marketing| 2000|
| Saif| Sales| 4100|
+-------------+----------+------+
如何按部门分组并将所有其他值放入列表中,如下所示:
department | employee_name | salary |
---|---|---|
Sales | [James, Michael, Robert, James, Saif] | [3000, 4600, 4100, 3000, 4100] |
Finance | [Maria, Scott, Jen] | [3000, 3300, 3900] |
Marketing | [Jeff, Kumar] | [3000, 2000] |
将 collect_list
与 groupBy
子句一起使用
from pyspark.sql.functions import *
df.groupBy(col("department")).agg(collect_list(col("employee_name")).alias("employee_name"),collect_list(col("employee_name")).alias("salary"))
让我们尝试最少的输入;
df.groupby('department').agg(*[collect_list(c).alias(c) for c in df.drop('department').columns]).show()