Pyspark - 使用函数添加带有值的新列 - group by 和 max

Pyspark - adding new column with values by using function - group by and max

我有一个场景,我必须从 group by 和 max 中获取结果并创建一个新列:

例如,假设我有这个数据:

|employee_name|department|state|salary|
+-------------+----------+-----+------+
|        James|     Sales|   NY| 90000|
|      Michael|     Sales|   NY| 86000|
|       Robert|     Sales|   CA| 81000|
|        Maria|   Finance|   CA| 90000|
|        Raman|   Finance|   CA| 99000|
|        Scott|   Finance|   NY| 83000|
|         Jeff| Marketing|   CA| 80000|
|        Kumar| Marketing|   NY| 91000|

我的输出应该是这样的:

|employee_name|department|state|salary|max(salary by department)
+-------------+----------+-----+------+---
|        James|     Sales|   NY| 90000| 90000
|      Michael|     Sales|   NY| 86000| 90000
|       Robert|     Sales|   CA| 81000| 90000
|        Maria|   Finance|   CA| 85000| 88000
|        Raman|   Finance|   CA| 88000| 88000
|        Scott|   Finance|   NY| 83000| 88000
|         Jeff| Marketing|   CA| 80000| 91000
|        Kumar| Marketing|   NY| 91000| 91000

有什么建议吗?会有很大帮助。

import pyspark.sql.functions as F

result = df.join(df.groupBy('department').agg(F.max('salary').alias('max_salary')).select('department','max_salary'), 
'department')

你也可以使用partition代替groupby。

    df=df.withColumn('max_in_dept',F.max('salary')\
   .over(Window.partitionBy('department')))
    df.show(5,False)