用 NaN 中位数替换异常值

Replace outliers with median exept NaN

我想用数据框中的中值替换异常值,但只有异常值而不是 NaN。

第一个:

      January  February 
0      -5.0     -7.0 
1      -6.0     -6.0 
2      -5.0     -5.0  
3      -3.0     -6.0 
4      -6.0     -8.0   
5     -11.0     -9.0    
6      -6.0      5.0    
7      -8.0    -11.0  
8     -11.0    -12.0  
9      -8.0     -9.0     
10     -8.0     -6.0   
11     -8.0     -5.0    
12     -8.0     -4.0   
13    -10.0      1.0    
14    -10.0      3.0   
15     -9.0     -9.0    
16     -6.0     -6.0   
17     -6.0     -6.0   
18     -4.0     -4.0  
19     -8.0      2.0    
20     -9.0      3.0      
21    -14.0      1.0 
22    -15.0     -3.0  
23    -17.0     -4.0   
24    -19.0     -6.0     
25    -60.0     -8.0   
26     -8.0     -8.0   
27     -9.0    -11.0    
28     -5.0      NaN    
29     -6.0      NaN    
30     -7.0      NaN 

我想用中值替换异常值 -60:

df = df[df.apply(lambda x: np.abs(x - x.mean()) / x.std() < 4).all(axis=1)]

它工作正常,但它也会删除所有包含 NaN 的行,我该如何避免这种情况?

输出:

  January  February 
0      -5.0     -7.0 
1      -6.0     -6.0 
2      -5.0     -5.0  
3      -3.0     -6.0 
4      -6.0     -8.0   
5     -11.0     -9.0    
6      -6.0      5.0    
7      -8.0    -11.0  
8     -11.0    -12.0  
9      -8.0     -9.0     
10     -8.0     -6.0   
11     -8.0     -5.0    
12     -8.0     -4.0   
13    -10.0      1.0    
14    -10.0      3.0   
15     -9.0     -9.0    
16     -6.0     -6.0   
17     -6.0     -6.0   
18     -4.0     -4.0  
19     -8.0      2.0    
20     -9.0      3.0      
21    -14.0      1.0 
22    -15.0     -3.0  
23    -17.0     -4.0   
24    -19.0     -6.0     
25    -10.0     -8.0   
26     -8.0     -8.0   
27     -9.0    -11.0

如您所见,删除了3行,不太方便。有任何想法吗 ?谢谢!

您可以在您的逻辑中使用 .isna()

df = df[df.apply(lambda x: (np.abs(x - x.mean()) / x.std() < 4) | x.isna()).all(axis=1)]
print(df)

缺少打印件(通知索引 25 (-60.0):

      January  February
0        -5.0      -7.0
1        -6.0      -6.0
2        -5.0      -5.0
3        -3.0      -6.0
4        -6.0      -8.0
5       -11.0      -9.0
6        -6.0       5.0
7        -8.0     -11.0
8       -11.0     -12.0
9        -8.0      -9.0
10       -8.0      -6.0
11       -8.0      -5.0
12       -8.0      -4.0
13      -10.0       1.0
14      -10.0       3.0
15       -9.0      -9.0
16       -6.0      -6.0
17       -6.0      -6.0
18       -4.0      -4.0
19       -8.0       2.0
20       -9.0       3.0
21      -14.0       1.0
22      -15.0      -3.0
23      -17.0      -4.0
24      -19.0      -6.0
26       -8.0      -8.0
27       -9.0     -11.0
28       -5.0       NaN
29       -6.0       NaN
30       -7.0       NaN

使用numpy.where(...):

df[["January", "February"]]=\
    np.where(
        df.sub(df.mean(axis=0)).abs()\
        .div(df.std(axis=0))>=4, 
        df.median(axis=0), df
    )

输出:

    January  February
0      -5.0      -7.0
1      -6.0      -6.0
2      -5.0      -5.0
3      -3.0      -6.0
4      -6.0      -8.0
5     -11.0      -9.0
6      -6.0       5.0
7      -8.0     -11.0
8     -11.0     -12.0
9      -8.0      -9.0
10     -8.0      -6.0
11     -8.0      -5.0
12     -8.0      -4.0
13    -10.0       1.0
14    -10.0       3.0
15     -9.0      -9.0
16     -6.0      -6.0
17     -6.0      -6.0
18     -4.0      -4.0
19     -8.0       2.0
20     -9.0       3.0
21    -14.0       1.0
22    -15.0      -3.0
23    -17.0      -4.0
24    -19.0      -6.0
25     -8.0      -8.0
26     -8.0      -8.0
27     -9.0     -11.0
28     -5.0       NaN
29     -6.0       NaN
30     -7.0       NaN