如何使用百分位数过滤数据框以过滤掉异常值?

How to filter dataframe using percentiles to filter out outliers?

假设我有一个像这样的 spark 数据框:

+------------+-----------+
|category    |value      |
+------------+-----------+
|           a|          1|
|           a|          2|
|           b|          2|
|           a|          3|
|           b|          4|
|           a|          4|
|           b|          6|
|           b|          8|
+------------+-----------+

我想为每个类别设置高于 0.75 个百分位数的值

那个存在;

a_values = [1,2,3,4] => a_values_filtered = [1,2,3,nan]
b_values = [2,4,6,8] => b_values_filtered = [2,3,6,nan]

所以预期的输出是:

+------------+-----------+
|category    |value      |
+------------+-----------+
|           a|          1|
|           a|          2|
|           b|          2|
|           a|          3|
|           b|          4|
|           a|        nan|
|           b|          6|
|           b|        nan|
+------------+-----------+

知道如何干净利落地做到这一点吗?

PS:我是spark新手

使用percent_rank函数获取百分位数,然后使用when将值> 0.75 percent_rank分配给null

from pyspark.sql import Window
from pyspark.sql.functions import percent_rank,when
w = Window.partitionBy(df.category).orderBy(df.value)
percentiles_df = df.withColumn('percentile',percent_rank().over(w))
result = percentiles_df.select(percentiles_df.category
                               ,when(percentiles_df.percentile <= 0.75,percentiles_df.value).alias('value'))
result.show()

这是另一个类似于 Prabhala 的回答的片段,我使用 percentile_approx UDF 代替。

from pyspark.sql import Window 
import pyspark.sql.functions as F 
window = Window.partitionBy('category') 
percentile = F.expr('percentile_approx(value, 0.75)') 
tmp_df = df.withColumn('percentile_value', percentile.over(window))

result = tmp_df.select('category', when(tmp_df.percentile_value >= tmp_df.value, tmp_df.value).alias('value'))
result.show() 

+--------+-----+
|category|value|
+--------+-----+
|       b|    2|
|       b|    4|
|       b|    6|
|       b| null|
|       a|    1|
|       a|    2|
|       a|    3|
|       a| null|
+--------+-----+