sklearn中计算出来的Robustscaler好像不对

Question

我试了sklearn中的Robustscaler，结果和公式不一样

sklearn中Robustscaler的公式为：

我有一个如下所示的矩阵：

我测试了特征一中的第一个数据（第一行和第一列）。缩放后的值应为 (1-3)/(5.5-1.5) = -0.5。然而，sklearn 的结果是 -0.67。有谁知道哪里计算不对吗？

使用sklearn的代码如下：

import numpy as np
from sklearn.preprocessing import RobustScaler
x=[[1,2,3,4],[4,5,6,7],[7,8,9,10],[2,1,1,1]]
scaler = RobustScaler(quantile_range=(25.0, 75.0),with_centering=True)
x_new = scaler.fit_transform(x)
print(x_new)

Answer 1

来自 RobustScaler documentation（强调已添加）：

Centering and scaling happen independently on each feature by computing the relevant statistics on the samples in the training set.

即中位数和 IQR 数量是按每列计算的，而不是针对整个数组计算的。

弄清楚这一点后，让我们手动计算第一列的缩放值：

import numpy as np

x1 = np.array([1, 4, 7, 2]) # your 1st column here

q75, q25 = np.percentile(x1, [75 ,25])
iqr = q75 - q25

x1_med = np.median(x1)

x1_scaled = (x1-x1_med)/iqr
x1_scaled
# array([-0.66666667,  0.33333333,  1.33333333, -0.33333333])

这与你自己的第一列相同x_new，由scikit-learn计算：

# your code verbatim:
from sklearn.preprocessing import RobustScaler
x=[[1,2,3,4],[4,5,6,7],[7,8,9,10],[2,1,1,1]]
scaler = RobustScaler(quantile_range=(25.0, 75.0),with_centering=True)
x_new = scaler.fit_transform(x)
print(x_new)
# result
[[-0.66666667 -0.375      -0.35294118 -0.33333333]
 [ 0.33333333  0.375       0.35294118  0.33333333]
 [ 1.33333333  1.125       1.05882353  1.        ]
 [-0.33333333 -0.625      -0.82352941 -1.        ]]

np.all(x1_scaled == x_new[:,0])
# True

与其余列（特征）类似 - 在缩放它们之前，您需要分别计算每个列的中值和 IQR 值。

更新（评论后）：

正如 quartiles 上的维基百科条目中指出的那样：

For discrete distributions, there is no universal agreement on selecting the quartile values

另见相关参考，Sample quantiles in statistical packages:

There are a large number of different definitions used for sample quantiles in statistical computer packages

深入研究此处使用的 np.percentile 的文档，您会发现有不少于五 (5) 种不同的插值方法，而且并非所有方法都产生相同的结果（另请参见第 4上面链接的维基百科条目中展示的不同方法）；以下是这些方法的快速演示及其在上面定义的 x1 数据中的结果：

np.percentile(x1, [75 ,25]) # interpolation='linear' by default
# array([4.75, 1.75])

np.percentile(x1, [75 ,25], interpolation='lower')
# array([4, 1])

np.percentile(x1, [75 ,25], interpolation='higher')
# array([7, 2])

np.percentile(x1, [75 ,25], interpolation='midpoint')
# array([5.5, 1.5])

np.percentile(x1, [75 ,25], interpolation='nearest')
# array([4, 2])

除了没有两种方法产生相同的结果之外，您在自己的计算中使用的定义对应于 interpolation='midpoint'，而默认的 Numpy 方法是 interpolation='linear'。正如 Ben Reiniger 在下面的评论中正确指出的那样，source code of RobustScaler is np.nanpercentile 中实际使用的是什么（我在这里使用的变体 pf np.percentile 能够处理 nan 值）默认 interpolation='linear' 设置。

sklearn中计算出来的Robustscaler好像不对

The calculated Robustscaler in sklearn seems not right

python

scikit-learn

data-preprocessing