在 PySpark 中对列进行分组和求和并消除重复项
Grouping and sum of columns and eliminate duplicates in PySpark
我在 pyspark
中有一个如下所示的数据框
df = spark.createDataFrame(
[
('14_100_00','A',25,0),
('14_100_00','A',0,24),
('15_100_00','A',20,1),
('150_100','C',21,0),
('16','A',0,20),
('16','A',20,0)],("rust", "name", "value_1","value_2" ))
df.show()
+---------+----+-------+-------+
| rust|name|value_1|value_2|
+---------+----+-------+-------+
|14_100_00| A| 25| 0|
|14_100_00| A| 0| 24|
|15_100_00| A| 20| 1|
| 150_100| C| 21| 0|
| 16| A| 0| 20|
| 16| A| 20| 0|
+---------+----+-------+-------+
我正在尝试根据以下条件更新 value_1
和 value_2
列
- 当
rust
和 name
列相同时,value_1
的总和为该组的 value_1
- 当
rust
和 name
列相同时,value_2
的总和为该组的 value_2
预期结果:
+---------+----+-------+-------+
| rust|name|value_1|value_2|
+---------+----+-------+-------+
|14_100_00| A| 25| 24|
|15_100_00| A| 20| 1|
| 150_100| C| 21| 0|
| 16| A| 20| 20|
+---------+----+-------+-------+
我试过这个:
df1 = df.withColumn("VALUE_1", f.sum("VALUE_1").over(Window.partitionBy("rust", "name"))).withColumn("VALUE_2", f.sum("VALUE_2").over(Window.partitionBy("rust", "name")))
df1.show()
+---------+----+-------+-------+
| rust|name|VALUE_1|VALUE_2|
+---------+----+-------+-------+
| 150_100| C| 21| 0|
| 16| A| 20| 20|
| 16| A| 20| 20|
|14_100_00| A| 25| 24|
|14_100_00| A| 25| 24|
|15_100_00| A| 20| 1|
+---------+----+-------+-------+
有没有更好的方法来实现这个而无需重复?
使用groupBy
代替window函数:
df1 = df.groupBy("rust", "name").agg(
F.sum("value_1").alias("value_1"),
F.sum("value_2").alias("value_2"),
)
df1.show()
#+---------+----+-------+-------+
#| rust|name|value_1|value_2|
#+---------+----+-------+-------+
#|14_100_00| A| 25| 24|
#|15_100_00| A| 20| 1|
#| 150_100| C| 21| 0|
#| 16| A| 20| 20|
#+---------+----+-------+-------+
我在 pyspark
df = spark.createDataFrame(
[
('14_100_00','A',25,0),
('14_100_00','A',0,24),
('15_100_00','A',20,1),
('150_100','C',21,0),
('16','A',0,20),
('16','A',20,0)],("rust", "name", "value_1","value_2" ))
df.show()
+---------+----+-------+-------+
| rust|name|value_1|value_2|
+---------+----+-------+-------+
|14_100_00| A| 25| 0|
|14_100_00| A| 0| 24|
|15_100_00| A| 20| 1|
| 150_100| C| 21| 0|
| 16| A| 0| 20|
| 16| A| 20| 0|
+---------+----+-------+-------+
我正在尝试根据以下条件更新 value_1
和 value_2
列
- 当
rust
和name
列相同时,value_1
的总和为该组的value_1
- 当
rust
和name
列相同时,value_2
的总和为该组的value_2
预期结果:
+---------+----+-------+-------+
| rust|name|value_1|value_2|
+---------+----+-------+-------+
|14_100_00| A| 25| 24|
|15_100_00| A| 20| 1|
| 150_100| C| 21| 0|
| 16| A| 20| 20|
+---------+----+-------+-------+
我试过这个:
df1 = df.withColumn("VALUE_1", f.sum("VALUE_1").over(Window.partitionBy("rust", "name"))).withColumn("VALUE_2", f.sum("VALUE_2").over(Window.partitionBy("rust", "name")))
df1.show()
+---------+----+-------+-------+
| rust|name|VALUE_1|VALUE_2|
+---------+----+-------+-------+
| 150_100| C| 21| 0|
| 16| A| 20| 20|
| 16| A| 20| 20|
|14_100_00| A| 25| 24|
|14_100_00| A| 25| 24|
|15_100_00| A| 20| 1|
+---------+----+-------+-------+
有没有更好的方法来实现这个而无需重复?
使用groupBy
代替window函数:
df1 = df.groupBy("rust", "name").agg(
F.sum("value_1").alias("value_1"),
F.sum("value_2").alias("value_2"),
)
df1.show()
#+---------+----+-------+-------+
#| rust|name|value_1|value_2|
#+---------+----+-------+-------+
#|14_100_00| A| 25| 24|
#|15_100_00| A| 20| 1|
#| 150_100| C| 21| 0|
#| 16| A| 20| 20|
#+---------+----+-------+-------+