在 PySpark 中对列进行分组和求和并消除重复项

Grouping and sum of columns and eliminate duplicates in PySpark

我在 pyspark

中有一个如下所示的数据框
df = spark.createDataFrame(
[
('14_100_00','A',25,0),
('14_100_00','A',0,24),
('15_100_00','A',20,1),
('150_100','C',21,0),
('16','A',0,20),
('16','A',20,0)],("rust", "name", "value_1","value_2" ))

df.show()
+---------+----+-------+-------+
|     rust|name|value_1|value_2|
+---------+----+-------+-------+
|14_100_00|   A|     25|      0|
|14_100_00|   A|      0|     24|
|15_100_00|   A|     20|      1|
|  150_100|   C|     21|      0|
|       16|   A|      0|     20|
|       16|   A|     20|      0|
+---------+----+-------+-------+

我正在尝试根据以下条件更新 value_1value_2

  1. rustname 列相同时,value_1 的总和为该组的 value_1
  2. rustname 列相同时,value_2 的总和为该组的 value_2

预期结果:

+---------+----+-------+-------+
|     rust|name|value_1|value_2|
+---------+----+-------+-------+
|14_100_00|   A|     25|     24|
|15_100_00|   A|     20|      1|
|  150_100|   C|     21|      0|
|       16|   A|     20|     20|
+---------+----+-------+-------+

我试过这个:

df1 = df.withColumn("VALUE_1", f.sum("VALUE_1").over(Window.partitionBy("rust", "name"))).withColumn("VALUE_2", f.sum("VALUE_2").over(Window.partitionBy("rust", "name")))
df1.show()
+---------+----+-------+-------+
|     rust|name|VALUE_1|VALUE_2|
+---------+----+-------+-------+
|  150_100|   C|     21|      0|
|       16|   A|     20|     20|
|       16|   A|     20|     20|
|14_100_00|   A|     25|     24|
|14_100_00|   A|     25|     24|
|15_100_00|   A|     20|      1|
+---------+----+-------+-------+

有没有更好的方法来实现这个而无需重复?

使用groupBy代替window函数:

df1 = df.groupBy("rust", "name").agg(
    F.sum("value_1").alias("value_1"),
    F.sum("value_2").alias("value_2"),
)
df1.show()
#+---------+----+-------+-------+
#|     rust|name|value_1|value_2|
#+---------+----+-------+-------+
#|14_100_00|   A|     25|     24|
#|15_100_00|   A|     20|      1|
#|  150_100|   C|     21|      0|
#|       16|   A|     20|     20|
#+---------+----+-------+-------+