计算 pyspark 数据框列的百分位数
Calculate percentile on pyspark dataframe columns
我有一个 PySpark 数据框,其中包含一个 ID,然后是几个我想为其计算 95% 点的变量。
printSchema() 的一部分:
root
|-- ID: string (nullable = true)
|-- MOU_G_EDUCATION_ADULT: double (nullable = false)
|-- MOU_G_EDUCATION_KIDS: double (nullable = false)
我找到了 How to derive Percentile using Spark Data frame and GroupBy in python,但是失败并显示一条错误消息:
perc95_udf = udf(lambda x: x.quantile(.95))
fanscores = genres.withColumn("P95_MOU_G_EDUCATION_ADULT", perc95_udf('MOU_G_EDUCATION_ADULT')) \
.withColumn("P95_MOU_G_EDUCATION_KIDS", perc95_udf('MOU_G_EDUCATION_KIDS'))
fanscores.take(2)
AttributeError: 'float' 对象没有属性 'quantile'
我已经尝试过的其他 UDF 试验:
def percentile(quantiel,kolom):
x=np.array(kolom)
perc=np.percentile(x, quantiel)
return perc
percentile_udf = udf(percentile, LongType())
fanscores = genres.withColumn("P95_MOU_G_EDUCATION_ADULT", percentile_udf(quantiel=95, kolom=genres.MOU_G_EDUCATION_ADULT)) \
.withColumn("P95_MOU_G_EDUCATION_KIDS", percentile_udf(quantiel=95, kolom=genres.MOU_G_EDUCATION_KIDS))
fanscores.take(2)
给出错误:"TypeError: wrapper() got an unexpected keyword argument 'quantiel'"
我的最终试炼:
import numpy as np
def percentile(quantiel):
return udf(lambda kolom: np.percentile(np.array(kolom), quantiel))
fanscores = genres.withColumn("P95_MOU_G_EDUCATION_ADULT", percentile(quantiel=95)(genres.MOU_G_EDUCATION_ADULT)) \
.withColumn("P95_MOU_G_EDUCATION_KIDS", percentile(quantiel=95) (genres.MOU_G_EDUCATION_KIDS))
fanscores.take(2)
给出错误:
PickleException:构造 ClassDict 的预期参数为零(numpy.dtype)
我该如何解决这个问题?
df.selectExpr('percentile(MOU_G_EDUCATION_ADULT, 0.95)').show()
对于大型数据集,考虑使用 percentile_approx()
我有一个 PySpark 数据框,其中包含一个 ID,然后是几个我想为其计算 95% 点的变量。
printSchema() 的一部分:
root
|-- ID: string (nullable = true)
|-- MOU_G_EDUCATION_ADULT: double (nullable = false)
|-- MOU_G_EDUCATION_KIDS: double (nullable = false)
我找到了 How to derive Percentile using Spark Data frame and GroupBy in python,但是失败并显示一条错误消息:
perc95_udf = udf(lambda x: x.quantile(.95))
fanscores = genres.withColumn("P95_MOU_G_EDUCATION_ADULT", perc95_udf('MOU_G_EDUCATION_ADULT')) \
.withColumn("P95_MOU_G_EDUCATION_KIDS", perc95_udf('MOU_G_EDUCATION_KIDS'))
fanscores.take(2)
AttributeError: 'float' 对象没有属性 'quantile'
我已经尝试过的其他 UDF 试验:
def percentile(quantiel,kolom):
x=np.array(kolom)
perc=np.percentile(x, quantiel)
return perc
percentile_udf = udf(percentile, LongType())
fanscores = genres.withColumn("P95_MOU_G_EDUCATION_ADULT", percentile_udf(quantiel=95, kolom=genres.MOU_G_EDUCATION_ADULT)) \
.withColumn("P95_MOU_G_EDUCATION_KIDS", percentile_udf(quantiel=95, kolom=genres.MOU_G_EDUCATION_KIDS))
fanscores.take(2)
给出错误:"TypeError: wrapper() got an unexpected keyword argument 'quantiel'"
我的最终试炼:
import numpy as np
def percentile(quantiel):
return udf(lambda kolom: np.percentile(np.array(kolom), quantiel))
fanscores = genres.withColumn("P95_MOU_G_EDUCATION_ADULT", percentile(quantiel=95)(genres.MOU_G_EDUCATION_ADULT)) \
.withColumn("P95_MOU_G_EDUCATION_KIDS", percentile(quantiel=95) (genres.MOU_G_EDUCATION_KIDS))
fanscores.take(2)
给出错误:
PickleException:构造 ClassDict 的预期参数为零(numpy.dtype)
我该如何解决这个问题?
df.selectExpr('percentile(MOU_G_EDUCATION_ADULT, 0.95)').show()
对于大型数据集,考虑使用 percentile_approx()