我们可以用 pyspark 中的预测值替换异常值吗?

Can we replace outliers with the predicted values in pyspark?

我在 spark 中有一个 df: (我实际上正在处理这个数据集,不可能粘贴整个数据所以这里是 link)

df = https://www.kaggle.com/schirmerchad/bostonhoustingmlnd?select=housing.csv

现在我发现异常值如下(共22行):

    def IQR(df,column):
         quantiles = sdf.approxQuantile(column, [0.25, 0.75], 0)
         q1 = quantiles[0]
         q3 = quantiles[1]
         IQR = q3-q1
         lower = q1 - 1.5*IQR
         upper = q3+ 1.5*IQR
     return (lower,upper)
   lower, upper = IQR(df,'RM')
   lower,upper = 4.8374999999999995 7.617500000000001
   outliers = df.filter((df['RM'] > upper) | (df['RM'] < lower))

下面是检测到的异常值:

RM  LSTAT   PTRATIO MEDV
8.069   4.21    18  812700
7.82    3.57    18  919800
7.765   7.56    17.8    835800
7.853   3.81    14.7    1018500
8.266   4.14    17.4    940800
8.04    3.13    17.4    789600
7.686   3.92    17.4    980700
8.337   2.47    17.4    875700
8.247   3.95    17.4    1014300
8.259   3.54    19.1    898800
8.398   5.91    13  1024800
7.691   6.58    18.6    739200
7.82    3.76    14.9    953400
7.645   3.01    14.9    966000
3.561   7.12    20.2    577500
3.863   13.33   20.2    485100
4.138   37.97   20.2    289800
4.368   30.63   20.2    184800
4.652   28.28   20.2    220500
4.138   23.34   20.2    249900
4.628   34.37   20.2    375900
4.519   36.98   20.2    147000

现在我想用ml预测值替换异常值,经过ml处理后我得到如下预测值:-

  RM    LSTAT   PTRATIO MEDV    column_assem        column          prediction
8.069   4.21    18  812700  {"vectorType":"dense","length":3,"values":[4.21,18,812700]} {"vectorType":"dense","length":3,"values":[812699.9991344779,32.9872628621034,25.697942748362507]}  7.138307692307692
7.82    3.57    18  919800  {"vectorType":"dense","length":3,"values":[3.57,18,919800]} {"vectorType":"dense","length":3,"values":[919799.999082192,36.25675952004636,26.656936598060938]}  7.138307692307692
7.765   7.56    17.8    835800  {"vectorType":"dense","length":3,"values":[7.56,17.8,835800]}   {"vectorType":"dense","length":3,"values":[835799.9989959698,37.18609141885786,25.87518521779868]}  7.138307692307692
7.853   3.81    14.7    1018500 {"vectorType":"dense","length":3,"values":[3.81,14.7,1018500]}  {"vectorType":"dense","length":3,"values":[1018499.9990279829,40.25963007114179,24.285126110831364]}    7.138307692307692
8.266   4.14    17.4    940800  {"vectorType":"dense","length":3,"values":[4.14,17.4,940800]}   {"vectorType":"dense","length":3,"values":[940799.9990507461,37.621770135316275,26.279618209844216]}    7.138307692307692
8.04    3.13    17.4    789600  {"vectorType":"dense","length":3,"values":[3.13,17.4,789600]}   {"vectorType":"dense","length":3,"values":[789599.999195178,31.094759131505864,24.832393813608636]} 7.138307692307692
7.686   3.92    17.4    980700  {"vectorType":"dense","length":3,"values":[3.92,17.4,980700]}   {"vectorType":"dense","length":3,"values":[980699.9990305867,38.858227336579965,26.637789595102927]}    7.138307692307692
8.337   2.47    17.4    875700  {"vectorType":"dense","length":3,"values":[2.47,17.4,875700]}   {"vectorType":"dense","length":3,"values":[875699.9991585133,33.577861049146954,25.59625197564997]} 7.138307692307692
8.247   3.95    17.4    1014300 {"vectorType":"dense","length":3,"values":[3.95,17.4,1014300]}  {"vectorType":"dense","length":3,"values":[1014299.9990056665,40.11446130241714,26.949909126197]}   7.138307692307692
8.259   3.54    19.1    898800  {"vectorType":"dense","length":3,"values":[3.54,19.1,898800]}   {"vectorType":"dense","length":3,"values":[898799.9990899825,35.406713649671325,27.56000332051734]} 7.138307692307692
8.398   5.91    13  1024800 {"vectorType":"dense","length":3,"values":[5.91,13,1024800]}    {"vectorType":"dense","length":3,"values":[1024799.9989586923,42.669988999612016,22.74784587477886]}    7.138307692307692
7.691   6.58    18.6    739200  {"vectorType":"dense","length":3,"values":[6.58,18.6,739200]}   {"vectorType":"dense","length":3,"values":[739199.9990946348,32.64270527156902,25.73328780757773]}  7.138307692307692
7.82    3.76    14.9    953400  {"vectorType":"dense","length":3,"values":[3.76,14.9,953400]}   {"vectorType":"dense","length":3,"values":[953399.9990744753,37.82403517229104,23.880552758747136]} 7.138307692307692
7.645   3.01    14.9    966000  {"vectorType":"dense","length":3,"values":[3.01,14.9,966000]}   {"vectorType":"dense","length":3,"values":[965999.9990932231,37.53477931241747,23.960460322415766]} 7.138307692307692
3.561   7.12    20.2    577500  {"vectorType":"dense","length":3,"values":[7.12,20.2,577500]}   {"vectorType":"dense","length":3,"values":[577499.9991773808,27.20258411502299,25.862694427868608]} 6.376732394366198
3.863   13.33   20.2    485100  {"vectorType":"dense","length":3,"values":[13.33,20.2,485100]}  {"vectorType":"dense","length":3,"values":[485099.999013695,30.032948373359417,25.311342678468208]} 6.043858108108108
4.138   37.97   20.2    289800  {"vectorType":"dense","length":3,"values":[37.97,20.2,289800]}  {"vectorType":"dense","length":3,"values":[289799.99824280146,47.51591753902686,24.707706732637366]}    5.2370714285714275
4.368   30.63   20.2    184800  {"vectorType":"dense","length":3,"values":[30.63,20.2,184800]}  {"vectorType":"dense","length":3,"values":[184799.99858809082,36.35256433967503,23.378827944979733]}    5.2370714285714275
4.652   28.28   20.2    220500  {"vectorType":"dense","length":3,"values":[28.28,20.2,220500]}  {"vectorType":"dense","length":3,"values":[220499.9986495131,35.3082739723793,23.59425617851294]}   5.2370714285714275
4.138   23.34   20.2    249900  {"vectorType":"dense","length":3,"values":[23.34,20.2,249900]}  {"vectorType":"dense","length":3,"values":[249899.99881098093,31.44714189260281,23.625084354536643]}    6.043858108108108
4.628   34.37   20.2    375900  {"vectorType":"dense","length":3,"values":[34.37,20.2,375900]}  {"vectorType":"dense","length":3,"values":[375899.9983146336,47.06252004732307,25.328138233469573]} 5.2370714285714275
4.519   36.98   20.2    147000  {"vectorType":"dense","length":3,"values":[36.98,20.2,147000]}  {"vectorType":"dense","length":3,"values":[146999.99838054206,41.31545014321207,23.33912202640834]} 5.2370714285714275

如果它是一个值,我知道 lit() 可以替换它,但是当有多个值时,我们如何替换为原始值?

假设原始数据帧称为df,机器学习转换后的数据帧称为ml,如果行满足,您可以进行连接并用预测值替换RM列离群条件:

df2 = df.join(ml, df.columns, 'left').withColumn(
    'RM', 
    F.when(
        (F.col('RM') > upper) | (F.col('RM') < lower), 
        F.col('prediction')
    ).otherwise(F.col('RM'))
).select(df.columns)