Pyspark:如何计算数据框中条件的百分比
Pyspark : how to compute the percentage with condition in dataframe
如何计算性能数量使得性能=P<=5; P>5 & P<=15 ; P>15
address
performance = P
NACELLES
589
NACELLES
0
NACELLES
48
NACELLES
318
NACELLES
378
NACELLES
52
NACELLES
45
NACELLES
201
NACELLES
416
NACELLES
29
NACELLES
183
NACELLES
53
NACELLES
7
NACELLES
127
NACELLES
157
NACELLES
248
NACELLES
10
NACELLES
317
NACELLES
2
NACELLES
4
我们得到这个数据集
address
P<=5
P>5 & P<=15
P> 15
NACELLES
15 %
10 %
75 %
以您的数据框为例:
+--------+-----------+
| address|performance|
+--------+-----------+
|NACELLES| 589|
|NACELLES| 0|
|NACELLES| 48|
|NACELLES| 318|
您只需使用 when 函数进行聚合和求和即可:
df.groupBy("address").agg(
(F.sum(F.when(F.col("performance") <= 5, 1)) / F.count("*")).alias("P<=5"),
(
F.sum(F.when((F.col("performance") > 5) & (F.col("performance") <= 15), 1))
/ F.count("*")
).alias("P>5 & P<=15"),
(F.sum(F.when(F.col("performance") > 15, 1)) / F.count("*")).alias("P>15"),
).show()
+--------+----+-----------+----+
| address|P<=5|P>5 & P<=15|P>15|
+--------+----+-----------+----+
|NACELLES|0.15| 0.1|0.75|
+--------+----+-----------+----+
如何计算性能数量使得性能=P<=5; P>5 & P<=15 ; P>15
address | performance = P |
---|---|
NACELLES | 589 |
NACELLES | 0 |
NACELLES | 48 |
NACELLES | 318 |
NACELLES | 378 |
NACELLES | 52 |
NACELLES | 45 |
NACELLES | 201 |
NACELLES | 416 |
NACELLES | 29 |
NACELLES | 183 |
NACELLES | 53 |
NACELLES | 7 |
NACELLES | 127 |
NACELLES | 157 |
NACELLES | 248 |
NACELLES | 10 |
NACELLES | 317 |
NACELLES | 2 |
NACELLES | 4 |
我们得到这个数据集
address | P<=5 | P>5 & P<=15 | P> 15 |
---|---|---|---|
NACELLES | 15 % | 10 % | 75 % |
以您的数据框为例:
+--------+-----------+
| address|performance|
+--------+-----------+
|NACELLES| 589|
|NACELLES| 0|
|NACELLES| 48|
|NACELLES| 318|
您只需使用 when 函数进行聚合和求和即可:
df.groupBy("address").agg(
(F.sum(F.when(F.col("performance") <= 5, 1)) / F.count("*")).alias("P<=5"),
(
F.sum(F.when((F.col("performance") > 5) & (F.col("performance") <= 15), 1))
/ F.count("*")
).alias("P>5 & P<=15"),
(F.sum(F.when(F.col("performance") > 15, 1)) / F.count("*")).alias("P>15"),
).show()
+--------+----+-----------+----+
| address|P<=5|P>5 & P<=15|P>15|
+--------+----+-----------+----+
|NACELLES|0.15| 0.1|0.75|
+--------+----+-----------+----+