Pyspark - 使用函数添加带有值的新列 - group by 和 max
Pyspark - adding new column with values by using function - group by and max
我有一个场景,我必须从 group by 和 max 中获取结果并创建一个新列:
例如,假设我有这个数据:
|employee_name|department|state|salary|
+-------------+----------+-----+------+
| James| Sales| NY| 90000|
| Michael| Sales| NY| 86000|
| Robert| Sales| CA| 81000|
| Maria| Finance| CA| 90000|
| Raman| Finance| CA| 99000|
| Scott| Finance| NY| 83000|
| Jeff| Marketing| CA| 80000|
| Kumar| Marketing| NY| 91000|
我的输出应该是这样的:
|employee_name|department|state|salary|max(salary by department)
+-------------+----------+-----+------+---
| James| Sales| NY| 90000| 90000
| Michael| Sales| NY| 86000| 90000
| Robert| Sales| CA| 81000| 90000
| Maria| Finance| CA| 85000| 88000
| Raman| Finance| CA| 88000| 88000
| Scott| Finance| NY| 83000| 88000
| Jeff| Marketing| CA| 80000| 91000
| Kumar| Marketing| NY| 91000| 91000
有什么建议吗?会有很大帮助。
import pyspark.sql.functions as F
result = df.join(df.groupBy('department').agg(F.max('salary').alias('max_salary')).select('department','max_salary'),
'department')
你也可以使用partition代替groupby。
df=df.withColumn('max_in_dept',F.max('salary')\
.over(Window.partitionBy('department')))
df.show(5,False)
我有一个场景,我必须从 group by 和 max 中获取结果并创建一个新列:
例如,假设我有这个数据:
|employee_name|department|state|salary|
+-------------+----------+-----+------+
| James| Sales| NY| 90000|
| Michael| Sales| NY| 86000|
| Robert| Sales| CA| 81000|
| Maria| Finance| CA| 90000|
| Raman| Finance| CA| 99000|
| Scott| Finance| NY| 83000|
| Jeff| Marketing| CA| 80000|
| Kumar| Marketing| NY| 91000|
我的输出应该是这样的:
|employee_name|department|state|salary|max(salary by department)
+-------------+----------+-----+------+---
| James| Sales| NY| 90000| 90000
| Michael| Sales| NY| 86000| 90000
| Robert| Sales| CA| 81000| 90000
| Maria| Finance| CA| 85000| 88000
| Raman| Finance| CA| 88000| 88000
| Scott| Finance| NY| 83000| 88000
| Jeff| Marketing| CA| 80000| 91000
| Kumar| Marketing| NY| 91000| 91000
有什么建议吗?会有很大帮助。
import pyspark.sql.functions as F
result = df.join(df.groupBy('department').agg(F.max('salary').alias('max_salary')).select('department','max_salary'),
'department')
你也可以使用partition代替groupby。
df=df.withColumn('max_in_dept',F.max('salary')\
.over(Window.partitionBy('department')))
df.show(5,False)