Pyspark - 在不同组级别应用 groupBy 聚合
Pyspark - Apply groupBy aggregations on different group levels
我有以下 pyspark 数据框:
+---------+-----+
| Day|Sunny|
+---------+-----+
| Sunday| Yes|
| Sunday| No|
| Monday| Yes|
| Monday| No|
| Tuesday| Yes|
| Tuesday| Yes|
| Tuesday| No|
| Tuesday| No|
| Tuesday| No|
|Wednesday| Yes|
|Wednesday| Yes|
|Wednesday| Yes|
|Wednesday| Yes|
|Wednesday| No|
| Thursday| Yes|
| Thursday| Yes|
| Thursday| No|
| Thursday| No|
| Friday| No|
| Friday| No|
| Friday| Yes|
| Saturday| Yes|
| Saturday| Yes|
| Saturday| No|
+---------+-----+
我对 Day
和 Sunny
两列进行了分组,然后计数,以获取每个(“Day”、“Sunny”)对的计数:
+---------+-----+-----+
| Day|Sunny|count|
+---------+-----+-----+
| Friday| No| 2|
| Friday| Yes| 1|
| Monday| Yes| 1|
| Monday| No| 1|
| Saturday| No| 1|
| Saturday| Yes| 2|
| Sunday| No| 1|
| Sunday| Yes| 1|
| Thursday| Yes| 2|
| Thursday| No| 2|
| Tuesday| No| 3|
| Tuesday| Yes| 2|
|Wednesday| Yes| 4|
|Wednesday| No| 1|
+---------+-----+-----+
我的问题是如何获得以下数据框,即在每个 Day
组中添加一列每个 Sunny
值计数的百分比:
+---------+-----+-----+---------------------+
| Day|Sunny|count| |
+---------+-----+-----+---------------------+
| Friday| No| 2| 66% of Friday |
| Friday| Yes| 1| 33% of Friday |
| Monday| Yes| 1| 50% of Monday |
| Monday| No| 1| 50% of Monday |
| Saturday| No| 1| 33% of Saturday |
| Saturday| Yes| 2| 66% of Saturday |
| Sunday| No| 1| 50% of Sunday |
| Sunday| Yes| 1| 50% of Sunday |
| Thursday| Yes| 2| 50% of Thursday |
| Thursday| No| 2| 50% of Thursday |
| Tuesday| No| 3| 60% of Tuesday |
| Tuesday| Yes| 2| 40% of Tuesday |
|Wednesday| Yes| 4| 80% of Wednesday |
|Wednesday| No| 1| 20% of Wednesday |
+---------+-----+-----+---------------------+
创建数据框的代码是:
df = spark.createDataFrame(
[
("Sunday", "Yes"),
("Sunday", "No"),
("Monday", "Yes"),
("Monday", "No"),
("Tuesday", "Yes"),
("Tuesday", "Yes"),
("Tuesday", "No"),
("Tuesday", "No"),
("Tuesday", "No"),
("Wednesday", "Yes"),
("Wednesday", "Yes"),
("Wednesday", "Yes"),
("Wednesday", "Yes"),
("Wednesday", "No"),
("Thursday", "Yes"),
("Thursday", "Yes"),
("Thursday", "No"),
("Thursday", "No"),
("Friday", "No"),
("Friday", "No"),
("Friday", "Yes"),
("Saturday", "Yes"),
("Saturday", "Yes"),
("Saturday", "No"),
],
["Day", "Sunny"]
)
df.groupBy(["Day", "Sunny"]).count().sort("Day").show()
您需要总天数 (= "count of (Fri, Yes)" + "count of (Fri, No)"
) 来计算百分比,因此您可以得到 sum
除以 Day
。
w = Window.partitionBy('Day')
df = df.withColumn('percentage', F.col('count') / F.sum(F.col('count')).over(w) * 100)
如果只想要整数,用floor/round取出浮点数
df = df.withColumn('percentage', F.floor(F.col('count') / F.sum(F.col('count')).over(w) * 100))
我有以下 pyspark 数据框:
+---------+-----+
| Day|Sunny|
+---------+-----+
| Sunday| Yes|
| Sunday| No|
| Monday| Yes|
| Monday| No|
| Tuesday| Yes|
| Tuesday| Yes|
| Tuesday| No|
| Tuesday| No|
| Tuesday| No|
|Wednesday| Yes|
|Wednesday| Yes|
|Wednesday| Yes|
|Wednesday| Yes|
|Wednesday| No|
| Thursday| Yes|
| Thursday| Yes|
| Thursday| No|
| Thursday| No|
| Friday| No|
| Friday| No|
| Friday| Yes|
| Saturday| Yes|
| Saturday| Yes|
| Saturday| No|
+---------+-----+
我对 Day
和 Sunny
两列进行了分组,然后计数,以获取每个(“Day”、“Sunny”)对的计数:
+---------+-----+-----+
| Day|Sunny|count|
+---------+-----+-----+
| Friday| No| 2|
| Friday| Yes| 1|
| Monday| Yes| 1|
| Monday| No| 1|
| Saturday| No| 1|
| Saturday| Yes| 2|
| Sunday| No| 1|
| Sunday| Yes| 1|
| Thursday| Yes| 2|
| Thursday| No| 2|
| Tuesday| No| 3|
| Tuesday| Yes| 2|
|Wednesday| Yes| 4|
|Wednesday| No| 1|
+---------+-----+-----+
我的问题是如何获得以下数据框,即在每个 Day
组中添加一列每个 Sunny
值计数的百分比:
+---------+-----+-----+---------------------+
| Day|Sunny|count| |
+---------+-----+-----+---------------------+
| Friday| No| 2| 66% of Friday |
| Friday| Yes| 1| 33% of Friday |
| Monday| Yes| 1| 50% of Monday |
| Monday| No| 1| 50% of Monday |
| Saturday| No| 1| 33% of Saturday |
| Saturday| Yes| 2| 66% of Saturday |
| Sunday| No| 1| 50% of Sunday |
| Sunday| Yes| 1| 50% of Sunday |
| Thursday| Yes| 2| 50% of Thursday |
| Thursday| No| 2| 50% of Thursday |
| Tuesday| No| 3| 60% of Tuesday |
| Tuesday| Yes| 2| 40% of Tuesday |
|Wednesday| Yes| 4| 80% of Wednesday |
|Wednesday| No| 1| 20% of Wednesday |
+---------+-----+-----+---------------------+
创建数据框的代码是:
df = spark.createDataFrame(
[
("Sunday", "Yes"),
("Sunday", "No"),
("Monday", "Yes"),
("Monday", "No"),
("Tuesday", "Yes"),
("Tuesday", "Yes"),
("Tuesday", "No"),
("Tuesday", "No"),
("Tuesday", "No"),
("Wednesday", "Yes"),
("Wednesday", "Yes"),
("Wednesday", "Yes"),
("Wednesday", "Yes"),
("Wednesday", "No"),
("Thursday", "Yes"),
("Thursday", "Yes"),
("Thursday", "No"),
("Thursday", "No"),
("Friday", "No"),
("Friday", "No"),
("Friday", "Yes"),
("Saturday", "Yes"),
("Saturday", "Yes"),
("Saturday", "No"),
],
["Day", "Sunny"]
)
df.groupBy(["Day", "Sunny"]).count().sort("Day").show()
您需要总天数 (= "count of (Fri, Yes)" + "count of (Fri, No)"
) 来计算百分比,因此您可以得到 sum
除以 Day
。
w = Window.partitionBy('Day')
df = df.withColumn('percentage', F.col('count') / F.sum(F.col('count')).over(w) * 100)
如果只想要整数,用floor/round取出浮点数
df = df.withColumn('percentage', F.floor(F.col('count') / F.sum(F.col('count')).over(w) * 100))