带转置的 pyspark 列总和
pyspark column sum with transpose
我的数据框看起来像 -
+---+---+---+---+
| id| w1| w2| w3|
+---+---+---+---+
| 1|100|150|200|
| 2|200|400|500|
| 3|500|600|150|
+---+---+---+---+
我希望输出看起来像 -
full total_amt
w1 800
w2 1150
w3 850
我的代码是 -
df = spark.createDataFrame(
[(1, 100,150,200), (2, 200,400,500), (3, 500,600,150)], ("id", "w1","w2","w3"))
res = df.unionAll(
df.select([
F.lit('All').alias('id'),
F.sum(df.w1).alias('w1'),
F.sum(df.w2).alias('w2'),
F.sum(df.w3).alias('w3')
]))
res.show()
But output gives me -
+---+---+----+---+
| id| w1| w2| w3|
+---+---+----+---+
| 1|100| 150|200|
| 2|200| 400|500|
| 3|500| 600|150|
|All|800|1150|850|
+---+---+----+---+
我认为添加后需要创建枢轴。所有字段本质上都是数字。
一个快速的解决方案可能是
>>> df.createOrReplaceTempView('df')
>>> spark.sql('''
... select 'w1' as full, sum(w1) as total from df
... union
... select 'w2' as full, sum(w2) as total from df
... union
... select 'w3' as full, sum(w3) as total from df
... ''').show()
+----+-----+
|full|total|
+----+-----+
| w2| 1150|
| w3| 850|
| w1| 800|
+----+-----+
试试这个方法 -
首先聚合数据,然后使用堆栈函数将列转换为行
import pyspark.sql.functions as psf
#perform aggregation
df_agg = df.agg(psf.sum('w1').alias('w1'), psf.sum('w2').alias('w2'), psf.sum('w3').alias('w3'))
#let's have a look at aggregated dataframe
df_agg.show()
#+---+----+---+
#| w1| w2| w3|
#+---+----+---+
#|800|1150|850|
#+---+----+---+
#Use stack function to convert column to rows
df_agg.selectExpr("stack(3, 'w1', w1, 'w2', w2, 'w3', w3) as (full, total)").show()
#+----+-----+
#|full|total|
#+----+-----+
#| w1| 800|
#| w2| 1150|
#| w3| 850|
#+----+-----+
我的数据框看起来像 -
+---+---+---+---+
| id| w1| w2| w3|
+---+---+---+---+
| 1|100|150|200|
| 2|200|400|500|
| 3|500|600|150|
+---+---+---+---+
我希望输出看起来像 -
full total_amt
w1 800
w2 1150
w3 850
我的代码是 -
df = spark.createDataFrame(
[(1, 100,150,200), (2, 200,400,500), (3, 500,600,150)], ("id", "w1","w2","w3"))
res = df.unionAll(
df.select([
F.lit('All').alias('id'),
F.sum(df.w1).alias('w1'),
F.sum(df.w2).alias('w2'),
F.sum(df.w3).alias('w3')
]))
res.show()
But output gives me -
+---+---+----+---+
| id| w1| w2| w3|
+---+---+----+---+
| 1|100| 150|200|
| 2|200| 400|500|
| 3|500| 600|150|
|All|800|1150|850|
+---+---+----+---+
我认为添加后需要创建枢轴。所有字段本质上都是数字。
一个快速的解决方案可能是
>>> df.createOrReplaceTempView('df')
>>> spark.sql('''
... select 'w1' as full, sum(w1) as total from df
... union
... select 'w2' as full, sum(w2) as total from df
... union
... select 'w3' as full, sum(w3) as total from df
... ''').show()
+----+-----+
|full|total|
+----+-----+
| w2| 1150|
| w3| 850|
| w1| 800|
+----+-----+
试试这个方法 -
首先聚合数据,然后使用堆栈函数将列转换为行
import pyspark.sql.functions as psf
#perform aggregation
df_agg = df.agg(psf.sum('w1').alias('w1'), psf.sum('w2').alias('w2'), psf.sum('w3').alias('w3'))
#let's have a look at aggregated dataframe
df_agg.show()
#+---+----+---+
#| w1| w2| w3|
#+---+----+---+
#|800|1150|850|
#+---+----+---+
#Use stack function to convert column to rows
df_agg.selectExpr("stack(3, 'w1', w1, 'w2', w2, 'w3', w3) as (full, total)").show()
#+----+-----+
#|full|total|
#+----+-----+
#| w1| 800|
#| w2| 1150|
#| w3| 850|
#+----+-----+