如何在pyspark中获得平均值?
How to get the mean in pyspark?
我有一个 spark 数据框:
df = spark.createDataFrame([(10, "Hyundai"), (20, "alpha") ,(70,'Audio'), (1000,'benz'), (50,'Suzuki'),(60,'Lambo'),(30,'Bmw')],["Cars", "Brand"])
现在我想找到离群值,为此我使用了 IQR 并获得了如下所示的上限值和下限值并找到了离群值:
lower, upper = -55.0 145.0
outliers= df.filter((df['Cars'] > upper) | (df['Cars'] < lower))
Cars Brand
1000 benz
现在我想找到排除异常值的均值,以发现我使用了函数以及何时使用但我得到了这样的错误
"TypeError: 'Column' object is not callable"
from pyspark.sql import functions as fun
mean = df.select(fun.when((df['Cars'] > upper) | (df['Cars'] < lower), fun.mean(df['Cars'].alias('mean')).collect()[0]['mean']))
print(mean)
是我的代码有误还是有更好的方法?
我认为您不需要使用 when
。你可以做一个过滤器并聚合平均值:
import pyspark.sql.functions as F
mean = df.filter((df['Cars'] <= upper) & (df['Cars'] >= lower)).agg(F.mean('cars').alias('mean'))
mean.show()
+----+
|mean|
+----+
|40.0|
+----+
如果要使用when
,可以使用条件聚合:
mean = df.agg(F.mean(F.when((df['Cars'] <= upper) & (df['Cars'] >= lower), df['Cars'])).alias('mean'))
mean.show()
+----+
|mean|
+----+
|40.0|
+----+
要收集到变量,可以使用 collect:
mean_collected = mean.collect()[0][0]
我有一个 spark 数据框:
df = spark.createDataFrame([(10, "Hyundai"), (20, "alpha") ,(70,'Audio'), (1000,'benz'), (50,'Suzuki'),(60,'Lambo'),(30,'Bmw')],["Cars", "Brand"])
现在我想找到离群值,为此我使用了 IQR 并获得了如下所示的上限值和下限值并找到了离群值:
lower, upper = -55.0 145.0
outliers= df.filter((df['Cars'] > upper) | (df['Cars'] < lower))
Cars Brand
1000 benz
现在我想找到排除异常值的均值,以发现我使用了函数以及何时使用但我得到了这样的错误
"TypeError: 'Column' object is not callable"
from pyspark.sql import functions as fun
mean = df.select(fun.when((df['Cars'] > upper) | (df['Cars'] < lower), fun.mean(df['Cars'].alias('mean')).collect()[0]['mean']))
print(mean)
是我的代码有误还是有更好的方法?
我认为您不需要使用 when
。你可以做一个过滤器并聚合平均值:
import pyspark.sql.functions as F
mean = df.filter((df['Cars'] <= upper) & (df['Cars'] >= lower)).agg(F.mean('cars').alias('mean'))
mean.show()
+----+
|mean|
+----+
|40.0|
+----+
如果要使用when
,可以使用条件聚合:
mean = df.agg(F.mean(F.when((df['Cars'] <= upper) & (df['Cars'] >= lower), df['Cars'])).alias('mean'))
mean.show()
+----+
|mean|
+----+
|40.0|
+----+
要收集到变量,可以使用 collect:
mean_collected = mean.collect()[0][0]