计算 pyspark 数据框中一列的总和和平均值，并为计算值创建一个新行

Question

我有一个 pyspark 数据框

Place   Month       Sector      Estimate    Profit  
USA     1/1/2020    Sector1     5944
Col     1/1/2020    Sector1     398
IND     1/1/2020    Sector1     25
USA     1/1/2020    Sector2                 6.9%
Col     1/1/2020    Sector2                 0.4%
China   1/1/2020    Sector2                 0.0%
Aus     1/1/2020    Sector2                 7.7%

我需要计算所有 Estimate 列（包括所有值）的总和以及按 Month 和 [=16 分组的所有 Profit 列（不包括 0.0%）的平均值=].

我需要在 Place 字段中添加一个额外的值，因为 Every Places 具有这些总和和平均值。所以，我想要的数据框应该是这样的：

Place           Month       Sector      Estimate    Profit  
USA             1/1/2020    Sector1     5944
Col             1/1/2020    Sector1     398
IND             1/1/2020    Sector1     25
USA             1/1/2020    Sector2                 6.9%
Col             1/1/2020    Sector2                 0.4%
China           1/1/2020    Sector2                 0.0%
Aus             1/1/2020    Sector2                 7.7%
Every Places    1/1/2020    Sector1     6367
Every Places    1/1/2020    Sector2                 5%

我尝试使用此代码，但我得到：

TypeError: Column is not iterable` error.

df1=df.withColumn('Place',lit('Every Places')) \
               .groupBy('Month','Sector') \
               .sum((col('Estimate'))),
               avg(F.col('Profit'))

我该如何解决这个问题？

Answer 1

您可以先按 Month + Sector 分组以计算 Estimate 的总和和 Profit 的平均值，然后使用与原始数据框的联合来获得预期输出：

import pyspark.sql.functions as F

df = spark.createDataFrame([
    ("USA", "1/1/2020", "Sector1", 5944, None), ("Col", "1/1/2020", "Sector1", 398, None),
    ("IND", "1/1/2020", "Sector1", 25, None), ("USA", "1/1/2020", "Sector2", None, "6.9%"),
    ("Col", "1/1/2020", "Sector2", None, "0.4%"), ("China", "1/1/2020", "Sector2", None, "0.0%"),
    ("Aus", "1/1/2020", "Sector2", None, "7.7%")], ["Place", "Month", "Sector", "Estimate", "Profit"]
)

grouped_df = df.withColumn(
    "Profit",
    F.regexp_extract("Profit", "(.+)%", 1) # extract percentage from string
).groupBy("Month", "Sector").agg(
    F.sum(F.col("Estimate")).alias("Estimate"),
    F.concat(
        F.sum("Profit") / F.sum(F.when(F.col("Profit") > 0.0, 1)), # exclude 0% from calculation
        F.lit("%")
    ).alias("Profit")
).withColumn(
    "Place",
    F.lit("Every Places")
)

df1 = df.unionByName(grouped_df)

df1.show()
#+------------+--------+-------+--------+------+
#|       Place|   Month| Sector|Estimate|Profit|
#+------------+--------+-------+--------+------+
#|         USA|1/1/2020|Sector1|    5944|  null|
#|         Col|1/1/2020|Sector1|     398|  null|
#|         IND|1/1/2020|Sector1|      25|  null|
#|         USA|1/1/2020|Sector2|    null|  6.9%|
#|         Col|1/1/2020|Sector2|    null|  0.4%|
#|       China|1/1/2020|Sector2|    null|  0.0%|
#|         Aus|1/1/2020|Sector2|    null|  7.7%|
#|Every Places|1/1/2020|Sector2|    null|  5.0%|
#|Every Places|1/1/2020|Sector1|  6367.0|  null|
#+------------+--------+-------+--------+------+

计算 pyspark 数据框中一列的总和和平均值，并为计算值创建一个新行

Calculate sum and average of a column in a pyspark dataframe and create a new row for the calculated values

python

apache-spark

pyspark

apache-spark-sql