百分比计算和分配给同一数据框中的新列
Percent calculation and assigning to new column in the same dataframe
我有如下所示的 spark 数据框:
+-------+----------+-----+
| Status| date |count|
+-------+----------+-----+
|Success|2019-09-06|23596|
|Failure|2019-09-06| 2494|
|Failure|2019-09-07| 1863|
|Success|2019-09-07|22399|
我正在尝试按日期计算 success/failure 的百分比并将结果添加到同一个 pyspark 数据帧中。在创建多个中间 tables/dataframe 后,我可以按组计算成功率或失败率。我们如何在不创建新的中间数据帧的情况下使用相同的单个数据帧?
预期输出:
+-------+----------+-----+----------------------
| Status| date |count| Percent |
+-------+----------+-----+----------------------
|Success|2019-09-06|23596| =(23596/(23596+2494)*100)
|Failure|2019-09-06| 2494| =(2494/(23596+2494)*100)
|Failure|2019-09-07| 1863| = (1863/(1863 + 22399)*100)
|Success|2019-09-07|22399| = (22399/(1863 + 22399)*100)
您可以在 'date' 列上使用 window
来获得相同的日期,然后在 window 上使用 sum
列 'count' :
import pyspark.sql.functions as F
from pyspark.sql.window import Window
window = Window.partitionBy(['date'])
df = df.withColumn('Percent', F.col('count')/F.sum('count').over(window)*100)
df.show()
+-------+-------------------+-----+-----------------+
| Status| date|count| Percent|
+-------+-------------------+-----+-----------------+
|Failure|2019-09-07 00:00:00| 1883|7.754715427065316|
|Success|2019-09-07 00:00:00|22399|92.24528457293468|
|Success|2019-09-06 00:00:00|23596|90.44078190877731|
|Failure|2019-09-06 00:00:00| 2494|9.559218091222691|
+-------+-------------------+-----+-----------------+
我有如下所示的 spark 数据框:
+-------+----------+-----+
| Status| date |count|
+-------+----------+-----+
|Success|2019-09-06|23596|
|Failure|2019-09-06| 2494|
|Failure|2019-09-07| 1863|
|Success|2019-09-07|22399|
我正在尝试按日期计算 success/failure 的百分比并将结果添加到同一个 pyspark 数据帧中。在创建多个中间 tables/dataframe 后,我可以按组计算成功率或失败率。我们如何在不创建新的中间数据帧的情况下使用相同的单个数据帧?
预期输出:
+-------+----------+-----+----------------------
| Status| date |count| Percent |
+-------+----------+-----+----------------------
|Success|2019-09-06|23596| =(23596/(23596+2494)*100)
|Failure|2019-09-06| 2494| =(2494/(23596+2494)*100)
|Failure|2019-09-07| 1863| = (1863/(1863 + 22399)*100)
|Success|2019-09-07|22399| = (22399/(1863 + 22399)*100)
您可以在 'date' 列上使用 window
来获得相同的日期,然后在 window 上使用 sum
列 'count' :
import pyspark.sql.functions as F
from pyspark.sql.window import Window
window = Window.partitionBy(['date'])
df = df.withColumn('Percent', F.col('count')/F.sum('count').over(window)*100)
df.show()
+-------+-------------------+-----+-----------------+
| Status| date|count| Percent|
+-------+-------------------+-----+-----------------+
|Failure|2019-09-07 00:00:00| 1883|7.754715427065316|
|Success|2019-09-07 00:00:00|22399|92.24528457293468|
|Success|2019-09-06 00:00:00|23596|90.44078190877731|
|Failure|2019-09-06 00:00:00| 2494|9.559218091222691|
+-------+-------------------+-----+-----------------+