为什么standardscaler和normalizer需要不同的数据输入？

Question

我尝试了以下代码，发现 StandardScaler(or MinMaxScaler) 和 sklearn 中的 Normalizer 处理数据的方式非常不同。这个问题使得管道建设更加困难。我想知道这种设计差异是否是故意的。

from sklearn.preprocessing import StandardScaler, Normalizer, MinMaxScaler

对于Normalizer，读取数据"horizontally"。

Normalizer(norm = 'max').fit_transform([[ 1., 1.,  2., 10],
                                        [ 2.,  0.,  0., 100],
                                        [ 0.,  -1., -1., 1000]])
#array([[ 0.1  ,  0.1  ,  0.2  ,  1.   ],
#       [ 0.02 ,  0.   ,  0.   ,  1.   ],
#       [ 0.   , -0.001, -0.001,  1.   ]])

对于StandardScaler和MinMaxScaler，读取数据"vertically"。

StandardScaler().fit_transform([[ 1., 1.,  2., 10],
                                [ 2.,  0.,  0., 100],
                                [ 0.,  -1., -1., 1000]])
#array([[ 0.        ,  1.22474487,  1.33630621, -0.80538727],
#       [ 1.22474487,  0.        , -0.26726124, -0.60404045],
#       [-1.22474487, -1.22474487, -1.06904497,  1.40942772]])

MinMaxScaler().fit_transform([[ 1., 1.,  2., 10],
                              [ 2.,  0.,  0., 100],
                              [ 0.,  -1., -1., 1000]])
#array([[0.5       , 1.        , 1.        , 0.        ],
#       [1.        , 0.5       , 0.33333333, 0.09090909],
#       [0.        , 0.        , 0.        , 1.        ]])

Answer 1

这是预期的行为，因为 StandardScaler 和 Normalizer 有不同的用途。 StandardScaler 有效 'vertically'，因为它...

Standardize[s] features by removing the mean and scaling to unit variance

[...] Centering and scaling happen independently on each feature by computing the relevant statistics on the samples in the training set. Mean and standard deviation are then stored to be used on later data using the transform method.

而 Normalizer 有效 'horizontally'，因为它...

Normalize[s] samples individually to unit norm.

Each sample (i.e. each row of the data matrix) with at least one non zero component is rescaled independently of other samples so that its norm (l1 or l2) equals one.

请查看 scikit-learn 文档（上面的链接），以获得更深入的了解，这会更好地满足您的目的。

为什么standardscaler和normalizer需要不同的数据输入？

Why do standardscaler and normalizer need different data input?

scaling

normalization

python-3.x

scikit-learn