如何使用 pyspark 将多个行值与 groupby 相加?
How to sum multiple row values with groupby using pyspark?
下面给出的是一个 pyspark 数据框,我需要用 groupby
对行值求和
load_dt|org_cntry|sum(srv_curr_vo_qty_accs_mthd)|sum(srv_curr_bb_qty_accs_mthd)|sum(srv_curr_tv_qty_accs_mthd)|
+-------------------+---------+------------------------------+------------------------------+------------------------------+
|2021-12-06 00:00:00| null| NaN| NaN| NaN|
|2021-12-06 00:00:00| PANAMA| 360126.0| 214229.0| 207950.0|
条件:
1.groupby(load_dt,org_cntry)
2.sum 行值 (sum(srv_curr_vo_qty_accs_mthd)|sum(srv_curr_bb_qty_accs_mthd)|sum(srv_curr_tv_qty_accs_mthd)|
预期输出
load_dt org_cntry total_sum
2021-12-06 Panama 782305
简单地求和(+)你的结果:
from pyspark.sql import functions as F
df.groupBy("load_dt", "org_cntry").agg(
(
F.sum("srv_curr_vo_qty_accs_mthd")
+ F.sum("srv_curr_bb_qty_accs_mthd")
+ F.sum("srv_curr_tv_qty_accs_mthd")
).alias("total_sum")
)
在这种情况下使用 Spark2.4+ 高阶函数。
Example:
#sample dataframe
#+-------------------+---------+--------+--------+--------+
#| load_dt|org_cntry| s1| s2| s3|
#+-------------------+---------+--------+--------+--------+
#|2021-12-06 00:00:00| PANAMA|360126.0|214229.0|207950.0|
#+-------------------+---------+--------+--------+--------+
#create array from sum columns then add all the array elements.
df.selectExpr("*", "AGGREGATE(array(s1,s2,s3), cast(0 as double), (x, y) -> x + y) total_sum").show()
#using withColumn
df.withColumn("total_sum", expr("AGGREGATE(array(s1,s2,s3), cast(0 as double), (x, y) -> x + y)")).show()
#+-------------------+---------+--------+--------+--------+---------+
#| load_dt|org_cntry| s1| s2| s3|total_sum|
#+-------------------+---------+--------+--------+--------+---------+
#|2021-12-06 00:00:00| PANAMA|360126.0|214229.0|207950.0| 782305.0|
#+-------------------+---------+--------+--------+--------+---------+
下面给出的是一个 pyspark 数据框,我需要用 groupby
对行值求和load_dt|org_cntry|sum(srv_curr_vo_qty_accs_mthd)|sum(srv_curr_bb_qty_accs_mthd)|sum(srv_curr_tv_qty_accs_mthd)|
+-------------------+---------+------------------------------+------------------------------+------------------------------+
|2021-12-06 00:00:00| null| NaN| NaN| NaN|
|2021-12-06 00:00:00| PANAMA| 360126.0| 214229.0| 207950.0|
条件:
1.groupby(load_dt,org_cntry)
2.sum 行值 (sum(srv_curr_vo_qty_accs_mthd)|sum(srv_curr_bb_qty_accs_mthd)|sum(srv_curr_tv_qty_accs_mthd)|
预期输出
load_dt org_cntry total_sum
2021-12-06 Panama 782305
简单地求和(+)你的结果:
from pyspark.sql import functions as F
df.groupBy("load_dt", "org_cntry").agg(
(
F.sum("srv_curr_vo_qty_accs_mthd")
+ F.sum("srv_curr_bb_qty_accs_mthd")
+ F.sum("srv_curr_tv_qty_accs_mthd")
).alias("total_sum")
)
在这种情况下使用 Spark2.4+ 高阶函数。
Example:
#sample dataframe
#+-------------------+---------+--------+--------+--------+
#| load_dt|org_cntry| s1| s2| s3|
#+-------------------+---------+--------+--------+--------+
#|2021-12-06 00:00:00| PANAMA|360126.0|214229.0|207950.0|
#+-------------------+---------+--------+--------+--------+
#create array from sum columns then add all the array elements.
df.selectExpr("*", "AGGREGATE(array(s1,s2,s3), cast(0 as double), (x, y) -> x + y) total_sum").show()
#using withColumn
df.withColumn("total_sum", expr("AGGREGATE(array(s1,s2,s3), cast(0 as double), (x, y) -> x + y)")).show()
#+-------------------+---------+--------+--------+--------+---------+
#| load_dt|org_cntry| s1| s2| s3|total_sum|
#+-------------------+---------+--------+--------+--------+---------+
#|2021-12-06 00:00:00| PANAMA|360126.0|214229.0|207950.0| 782305.0|
#+-------------------+---------+--------+--------+--------+---------+