数据标准化 vs 规范化 vs Robust Scaler

Data Standardization vs Normalization vs Robust Scaler

我正在从事数据预处理工作，想比较 数据标准化与规范化与稳健缩放器 的实际优势。

理论上，准则是：

优点：

标准化：缩放特征，使分布以 0 为中心，标准差为 1。
归一化：缩小范围，使范围现在介于 0 和 1 之间（如果有负值，则为 -1 到 1）。
Robust Scaler：类似于归一化，但它使用四分位数范围，因此它对异常值具有鲁棒性。

缺点：

标准化：如果数据不是正态分布（即没有高斯分布）则不好。
归一化：受到异常值（即极值）的严重影响。
Robust Scaler：不考虑中位数，只关注批量数据所在的部分。

我创建了 20 个随机 数值输入并尝试了上述方法（红色数字代表异常值）：

我注意到 -确实- 归一化受到异常值的负面影响，新值之间的变化比例变得很小（所有值几乎相同 -小数点后 6 位 - 0.000000x) 即使原始输入之间存在明显差异！

我的问题是：

我可以说标准化也会受到极值的负面影响吗？如果不是，为什么根据提供的结果？
我真的看不出 Robust Scaler 如何改进数据，因为我还有极端结果数据集中的值？有没有简单完整的解读？

Am I right to say that also Standardization gets affected negatively by the extreme values as well?

你确实是； scikit-learn docs 自己明确警告这种情况：

However, when data contains outliers, StandardScaler can often be mislead. In such cases, it is better to use a scaler that is robust against outliers.

差不多，MinMaxScaler也是一样。

I really can't see how the Robust Scaler improved the data because I still have extreme values in the resulted data set? Any simple -complete interpretation?

健壮并不意味着免疫，或无敌，缩放的目的是不“删除”离群值和极值——这是一个单独的任务，有自己的方法；这在 relevant scikit-learn docs:

中再次明确提及

RobustScaler

[...] Note that the outliers themselves are still present in the transformed data. If a separate outlier clipping is desirable, a non-linear transformation is required (see below).

其中“见下文”是指 QuantileTransformer and quantile_transform。

None 它们在某种意义上是稳健的，因为缩放会处理异常值并将它们放在一个受限的范围内，即不会出现极端值。

您可以考虑以下选项：

在缩放
如果裁剪不理想，则采用平方根或对数等变换
显然，添加另一列'is clipped'/'logarithmic clipped amount'将减少信息丢失。

数据标准化 vs 规范化 vs Robust Scaler

Data Standardization vs Normalization vs Robust Scaler

python

machine-learning

normalization

standardized

scikit-learn