每组按行求和并将总计添加为 Pyspark 数据框中的新行
Rowwise sum per group and add total as a new row in dataframe in Pyspark
我有一个像这个示例的数据框
df = spark.createDataFrame(
[(2, "A" , "A2" , 2500),
(2, "A" , "A11" , 3500),
(2, "A" , "A12" , 5500),
(4, "B" , "B25" , 7600),
(4, "B", "B26" ,5600),
(5, "C" , "c25" ,2658),
(5, "C" , "c27" , 1100),
(5, "C" , "c28" , 1200)],
['parent', 'group' , "brand" , "usage"])
output :
+------+-----+-----+-----+
|parent|group|brand|usage|
+------+-----+-----+-----+
| 2| A| A2| 2500|
| 2| A| A11| 3500|
| 4| B| B25| 7600|
| 4| B| B26| 5600|
| 5| C| c25| 2658|
| 5| C| c27| 1100|
| 5| C| c28| 1200|
+------+-----+-----+-----+
我想做的是计算每组的总使用量,并将其添加为具有品牌总价值的新行。我如何在 PySpark 中执行此操作?:
Expected result:
+------+-----+-----+-----+
|parent|group|brand|usage|
+------+-----+-----+-----+
| 2| A| A2| 2500|
| 2| A| A11| 3500|
| 2| A|Total| 6000|
| 4| B| B25| 7600|
| 4| B| B26| 5600|
| 4| B|Total|18700|
| 5| C| c25| 2658|
| 5| C| c27| 1100|
| 5| C| c28| 1200|
| 5| C|Total| 4958|
+------+-----+-----+-----+
import pyspark.sql.functions as F
df = spark.createDataFrame(
[(2, "A" , "A2" , 2500),
(2, "A" , "A11" , 3500),
(2, "A" , "A12" , 5500),
(4, "B" , "B25" , 7600),
(4, "B", "B26" ,5600),
(5, "C" , "c25" ,2658),
(5, "C" , "c27" , 1100),
(5, "C" , "c28" , 1200)],
['parent', 'group' , "brand" , "usage"])
df.show()
+------+-----+-----+-----+
|parent|group|brand|usage|
+------+-----+-----+-----+
| 2| A| A2| 2500|
| 2| A| A11| 3500|
| 2| A| A12| 5500|
| 4| B| B25| 7600|
| 4| B| B26| 5600|
| 5| C| c25| 2658|
| 5| C| c27| 1100|
| 5| C| c28| 1200|
+------+-----+-----+-----+
#Group by and sum to get the totals
totals = df.groupBy(['group','parent']).agg(F.sum('usage').alias('usage')).withColumn('brand', F.lit('Total'))
# create a temp variable to sort
totals = totals.withColumn('sort_id', F.lit(2))
df = df.withColumn('sort_id', F.lit(1))
#Union dataframes, drop temp variable and show
df.unionByName(totals).sort(['group','sort_id']).drop('sort_id').show()
+------+-----+-----+-----+
|parent|group|brand|usage|
+------+-----+-----+-----+
| 2| A| A12| 5500|
| 2| A| A11| 3500|
| 2| A| A2| 2500|
| 2| A|Total|11500|
| 4| B| B25| 7600|
| 4| B| B26| 5600|
| 4| B|Total|13200|
| 5| C| c25| 2658|
| 5| C| c28| 1200|
| 5| C| c27| 1100|
| 5| C|Total| 4958|
+------+-----+-----+-----+
我有一个像这个示例的数据框
df = spark.createDataFrame(
[(2, "A" , "A2" , 2500),
(2, "A" , "A11" , 3500),
(2, "A" , "A12" , 5500),
(4, "B" , "B25" , 7600),
(4, "B", "B26" ,5600),
(5, "C" , "c25" ,2658),
(5, "C" , "c27" , 1100),
(5, "C" , "c28" , 1200)],
['parent', 'group' , "brand" , "usage"])
output :
+------+-----+-----+-----+
|parent|group|brand|usage|
+------+-----+-----+-----+
| 2| A| A2| 2500|
| 2| A| A11| 3500|
| 4| B| B25| 7600|
| 4| B| B26| 5600|
| 5| C| c25| 2658|
| 5| C| c27| 1100|
| 5| C| c28| 1200|
+------+-----+-----+-----+
我想做的是计算每组的总使用量,并将其添加为具有品牌总价值的新行。我如何在 PySpark 中执行此操作?:
Expected result:
+------+-----+-----+-----+
|parent|group|brand|usage|
+------+-----+-----+-----+
| 2| A| A2| 2500|
| 2| A| A11| 3500|
| 2| A|Total| 6000|
| 4| B| B25| 7600|
| 4| B| B26| 5600|
| 4| B|Total|18700|
| 5| C| c25| 2658|
| 5| C| c27| 1100|
| 5| C| c28| 1200|
| 5| C|Total| 4958|
+------+-----+-----+-----+
import pyspark.sql.functions as F
df = spark.createDataFrame(
[(2, "A" , "A2" , 2500),
(2, "A" , "A11" , 3500),
(2, "A" , "A12" , 5500),
(4, "B" , "B25" , 7600),
(4, "B", "B26" ,5600),
(5, "C" , "c25" ,2658),
(5, "C" , "c27" , 1100),
(5, "C" , "c28" , 1200)],
['parent', 'group' , "brand" , "usage"])
df.show()
+------+-----+-----+-----+
|parent|group|brand|usage|
+------+-----+-----+-----+
| 2| A| A2| 2500|
| 2| A| A11| 3500|
| 2| A| A12| 5500|
| 4| B| B25| 7600|
| 4| B| B26| 5600|
| 5| C| c25| 2658|
| 5| C| c27| 1100|
| 5| C| c28| 1200|
+------+-----+-----+-----+
#Group by and sum to get the totals
totals = df.groupBy(['group','parent']).agg(F.sum('usage').alias('usage')).withColumn('brand', F.lit('Total'))
# create a temp variable to sort
totals = totals.withColumn('sort_id', F.lit(2))
df = df.withColumn('sort_id', F.lit(1))
#Union dataframes, drop temp variable and show
df.unionByName(totals).sort(['group','sort_id']).drop('sort_id').show()
+------+-----+-----+-----+
|parent|group|brand|usage|
+------+-----+-----+-----+
| 2| A| A12| 5500|
| 2| A| A11| 3500|
| 2| A| A2| 2500|
| 2| A|Total|11500|
| 4| B| B25| 7600|
| 4| B| B26| 5600|
| 4| B|Total|13200|
| 5| C| c25| 2658|
| 5| C| c28| 1200|
| 5| C| c27| 1100|
| 5| C|Total| 4958|
+------+-----+-----+-----+