Pyspark:如何计算数据框中条件的百分比

Pyspark : how to compute the percentage with condition in dataframe

如何计算性能数量使得性能=P<=5; P>5 & P<=15 ; P>15

address performance = P
NACELLES 589
NACELLES 0
NACELLES 48
NACELLES 318
NACELLES 378
NACELLES 52
NACELLES 45
NACELLES 201
NACELLES 416
NACELLES 29
NACELLES 183
NACELLES 53
NACELLES 7
NACELLES 127
NACELLES 157
NACELLES 248
NACELLES 10
NACELLES 317
NACELLES 2
NACELLES 4

我们得到这个数据集

address P<=5 P>5 & P<=15 P> 15
NACELLES 15 % 10 % 75 %

以您的数据框为例:

+--------+-----------+                                                          
| address|performance|
+--------+-----------+
|NACELLES|        589|
|NACELLES|          0|
|NACELLES|         48|
|NACELLES|        318|

您只需使用 when 函数进行聚合和求和即可:

df.groupBy("address").agg(
    (F.sum(F.when(F.col("performance") <= 5, 1)) / F.count("*")).alias("P<=5"),
    (
        F.sum(F.when((F.col("performance") > 5) & (F.col("performance") <= 15), 1))
        / F.count("*")
    ).alias("P>5 & P<=15"),
    (F.sum(F.when(F.col("performance") > 15, 1)) / F.count("*")).alias("P>15"),
).show()

+--------+----+-----------+----+
| address|P<=5|P>5 & P<=15|P>15|
+--------+----+-----------+----+
|NACELLES|0.15|        0.1|0.75|
+--------+----+-----------+----+