计算 pyspark 数据框中一列的总和和平均值,并为计算值创建一个新行
Calculate sum and average of a column in a pyspark dataframe and create a new row for the calculated values
我有一个 pyspark 数据框
Place Month Sector Estimate Profit
USA 1/1/2020 Sector1 5944
Col 1/1/2020 Sector1 398
IND 1/1/2020 Sector1 25
USA 1/1/2020 Sector2 6.9%
Col 1/1/2020 Sector2 0.4%
China 1/1/2020 Sector2 0.0%
Aus 1/1/2020 Sector2 7.7%
我需要计算所有 Estimate
列(包括所有值)的总和以及按 Month
和 [=16 分组的所有 Profit
列(不包括 0.0%)的平均值=].
我需要在 Place 字段中添加一个额外的值,因为 Every Places
具有这些总和和平均值。所以,我想要的数据框应该是这样的:
Place Month Sector Estimate Profit
USA 1/1/2020 Sector1 5944
Col 1/1/2020 Sector1 398
IND 1/1/2020 Sector1 25
USA 1/1/2020 Sector2 6.9%
Col 1/1/2020 Sector2 0.4%
China 1/1/2020 Sector2 0.0%
Aus 1/1/2020 Sector2 7.7%
Every Places 1/1/2020 Sector1 6367
Every Places 1/1/2020 Sector2 5%
我尝试使用此代码,但我得到:
TypeError: Column is not iterable` error.
df1=df.withColumn('Place',lit('Every Places')) \
.groupBy('Month','Sector') \
.sum((col('Estimate'))),
avg(F.col('Profit'))
我该如何解决这个问题?
您可以先按 Month
+ Sector
分组以计算 Estimate
的总和和 Profit
的平均值,然后使用与原始数据框的联合来获得预期输出:
import pyspark.sql.functions as F
df = spark.createDataFrame([
("USA", "1/1/2020", "Sector1", 5944, None), ("Col", "1/1/2020", "Sector1", 398, None),
("IND", "1/1/2020", "Sector1", 25, None), ("USA", "1/1/2020", "Sector2", None, "6.9%"),
("Col", "1/1/2020", "Sector2", None, "0.4%"), ("China", "1/1/2020", "Sector2", None, "0.0%"),
("Aus", "1/1/2020", "Sector2", None, "7.7%")], ["Place", "Month", "Sector", "Estimate", "Profit"]
)
grouped_df = df.withColumn(
"Profit",
F.regexp_extract("Profit", "(.+)%", 1) # extract percentage from string
).groupBy("Month", "Sector").agg(
F.sum(F.col("Estimate")).alias("Estimate"),
F.concat(
F.sum("Profit") / F.sum(F.when(F.col("Profit") > 0.0, 1)), # exclude 0% from calculation
F.lit("%")
).alias("Profit")
).withColumn(
"Place",
F.lit("Every Places")
)
df1 = df.unionByName(grouped_df)
df1.show()
#+------------+--------+-------+--------+------+
#| Place| Month| Sector|Estimate|Profit|
#+------------+--------+-------+--------+------+
#| USA|1/1/2020|Sector1| 5944| null|
#| Col|1/1/2020|Sector1| 398| null|
#| IND|1/1/2020|Sector1| 25| null|
#| USA|1/1/2020|Sector2| null| 6.9%|
#| Col|1/1/2020|Sector2| null| 0.4%|
#| China|1/1/2020|Sector2| null| 0.0%|
#| Aus|1/1/2020|Sector2| null| 7.7%|
#|Every Places|1/1/2020|Sector2| null| 5.0%|
#|Every Places|1/1/2020|Sector1| 6367.0| null|
#+------------+--------+-------+--------+------+
我有一个 pyspark 数据框
Place Month Sector Estimate Profit
USA 1/1/2020 Sector1 5944
Col 1/1/2020 Sector1 398
IND 1/1/2020 Sector1 25
USA 1/1/2020 Sector2 6.9%
Col 1/1/2020 Sector2 0.4%
China 1/1/2020 Sector2 0.0%
Aus 1/1/2020 Sector2 7.7%
我需要计算所有 Estimate
列(包括所有值)的总和以及按 Month
和 [=16 分组的所有 Profit
列(不包括 0.0%)的平均值=].
我需要在 Place 字段中添加一个额外的值,因为 Every Places
具有这些总和和平均值。所以,我想要的数据框应该是这样的:
Place Month Sector Estimate Profit
USA 1/1/2020 Sector1 5944
Col 1/1/2020 Sector1 398
IND 1/1/2020 Sector1 25
USA 1/1/2020 Sector2 6.9%
Col 1/1/2020 Sector2 0.4%
China 1/1/2020 Sector2 0.0%
Aus 1/1/2020 Sector2 7.7%
Every Places 1/1/2020 Sector1 6367
Every Places 1/1/2020 Sector2 5%
我尝试使用此代码,但我得到:
TypeError: Column is not iterable` error.
df1=df.withColumn('Place',lit('Every Places')) \
.groupBy('Month','Sector') \
.sum((col('Estimate'))),
avg(F.col('Profit'))
我该如何解决这个问题?
您可以先按 Month
+ Sector
分组以计算 Estimate
的总和和 Profit
的平均值,然后使用与原始数据框的联合来获得预期输出:
import pyspark.sql.functions as F
df = spark.createDataFrame([
("USA", "1/1/2020", "Sector1", 5944, None), ("Col", "1/1/2020", "Sector1", 398, None),
("IND", "1/1/2020", "Sector1", 25, None), ("USA", "1/1/2020", "Sector2", None, "6.9%"),
("Col", "1/1/2020", "Sector2", None, "0.4%"), ("China", "1/1/2020", "Sector2", None, "0.0%"),
("Aus", "1/1/2020", "Sector2", None, "7.7%")], ["Place", "Month", "Sector", "Estimate", "Profit"]
)
grouped_df = df.withColumn(
"Profit",
F.regexp_extract("Profit", "(.+)%", 1) # extract percentage from string
).groupBy("Month", "Sector").agg(
F.sum(F.col("Estimate")).alias("Estimate"),
F.concat(
F.sum("Profit") / F.sum(F.when(F.col("Profit") > 0.0, 1)), # exclude 0% from calculation
F.lit("%")
).alias("Profit")
).withColumn(
"Place",
F.lit("Every Places")
)
df1 = df.unionByName(grouped_df)
df1.show()
#+------------+--------+-------+--------+------+
#| Place| Month| Sector|Estimate|Profit|
#+------------+--------+-------+--------+------+
#| USA|1/1/2020|Sector1| 5944| null|
#| Col|1/1/2020|Sector1| 398| null|
#| IND|1/1/2020|Sector1| 25| null|
#| USA|1/1/2020|Sector2| null| 6.9%|
#| Col|1/1/2020|Sector2| null| 0.4%|
#| China|1/1/2020|Sector2| null| 0.0%|
#| Aus|1/1/2020|Sector2| null| 7.7%|
#|Every Places|1/1/2020|Sector2| null| 5.0%|
#|Every Places|1/1/2020|Sector1| 6367.0| null|
#+------------+--------+-------+--------+------+