如何通过使用 pandas 进行转换来不删除但处理异常值?

How to not remove but handle outliers by transforming using pandas?

我有一个如下所示的数据框

dfx = pd.DataFrame({'min_temp' :[-138,36,34,38,237,339]})

正如您在下面看到的,此数据中存在三个异常值 -138237239

我想做的是识别记录

a) 大于3 standard deviation 并用有效最大值替换它们(考虑数据范围)。

b) 小于 -3 standard deviation 并用有效的最小值替换它们(考虑数据范围)。

这是我尝试过的方法,但它不正确且效率不高

dfx.apply(lambda x: x[(x < dfx[min_temp].mean()-3*dfx[min_temp].std(), dfx[min_temp].mean()+3*dfx[min_temp].std())])

在上面的示例中,38 是最大值,因为它在 3sd 限制内并且是有效的最大值(意味着不是异常值)。同样,36 是最小值,因为它在 -3sd

范围内

我们需要用它来替换完整数据框中的所有异常值。

请注意,在我的真实数据中,我有超过 60 列和 100 万行。我想在所有列中执行此操作。任何高效且可扩展的方法都是有帮助的

我希望我的输出是这样的?您可以看到如何用 maximum valid value within 3sd (38 in this case)

替换异常值

你能帮我解决这个问题吗?

根据建议的解决方案更新

这是一个广义函数,它遵循以下逻辑来检测 异常值。

此函数将数据框作为参数,因此请确保您只有数字列。

for each data point X: abs(X - mean) <= (std * 3)

或者换句话说:

residual <= 3*std

def replace_outliers(df, n_std):

    outliers = df.sub(df.mean()).abs().le(df.std().mul(n_std))
    outliers_nan = df.where(outliers)

    outliers_replaced = outliers_nan.fillna(outliers_nan.max())

    return outliers_replaced

测试:

dfx = pd.DataFrame({'min_temp' : [10,12,12,13,12,11,14,13,15,10,10,10,100,12,14,13]})

# replaces 100 with 15
replace_outliers(dfx, 3)

此答案基于 this 一篇关于离群值检测的好文章中的信息。您可以在那里阅读每种方法。
每个代码的输出显示异常值检测的结果下限和上限。

首先,让我们定义一些示例数据:

import numpy as np

df = pd.DataFrame({'col1': np.random.normal(loc=20, scale=2, size=10)})

# Insert outliers
df['col1'][0] = 40
df['col1'][1] = 0

df['col1']

输出:

0    40.000000
1     0.000000
2    19.218962
3    16.648512
4    21.444715
5    22.637459
6    21.016641
7    22.527376
8    20.502631
9    20.715458
Name: col1, dtype: float64

Z 分数法

此方法是所有 3 种方法中最不可靠的。它不适用于小型数据集(均值和标准差受异常值的严重影响)。

def cap_outliers(series, zscore_threshold=3, verbose=False):
    '''Caps outliers to closest existing value within threshold (Z-score).'''
    mean_val = series.mean()
    std_val = series.std()

    z_score = (series - mean_val) / std_val
    outliers = abs(z_score) > zscore_threshold

    series = series.copy()
    series.loc[z_score > zscore_threshold] = series.loc[~outliers].max()
    series.loc[z_score < -zscore_threshold] = series.loc[~outliers].min()

    # For comparison purposes.
    if verbose:
            lbound = mean_val - zscore_threshold * std_val
            ubound = mean_val + zscore_threshold * std_val
            print('\n'.join(
                ['Capping outliers by the Z-score method:',
                 f'   Z-score threshold: {zscore_threshold}',
                 f'   Lower bound: {lbound}',
                 f'   Upper bound: {ubound}\n']))

    return series

cap_outliers(df['col1'], verbose=True)

输出:

Capping outliers by the Z-score method:
   Z-score threshold: 3
   Lower bound: -8.28385086324063
   Upper bound: 49.22620154113844

0    40.000000
1     0.000000
2    19.218962
3    16.648512
4    21.444715
5    22.637459
6    21.016641
7    22.527376
8    20.502631
9    20.715458
Name: col1, dtype: float64

改进的 Z 分数方法

这种方法比前一种方法更可靠。它使用 median 和 mad 而不是 mean 和 std。

def cap_outliers(series, zscore_threshold=3, verbose=False):
    '''Caps outliers to closest existing value within threshold (Modified Z-score).'''
    median_val = series.median()
    mad_val = series.mad() # Median absolute deviation

    z_score = (series - median_val) / mad_val
    outliers = abs(z_score) > zscore_threshold

    series = series.copy()
    series.loc[z_score > zscore_threshold] = series.loc[~outliers].max()
    series.loc[z_score < -zscore_threshold] = series.loc[~outliers].min() 

    # For comparison purposes.
    if verbose:
            lbound = median_val - zscore_threshold * mad_val
            ubound = median_val + zscore_threshold * mad_val
            print('\n'.join(
                ['Capping outliers by the Modified Z-score method:',
                 f'   Z-score threshold: {zscore_threshold}',
                 f'   Lower bound: {lbound}',
                 f'   Upper bound: {ubound}\n']))

    return series

cap_outliers(df['col1'], verbose=True)

输出:

Capping outliers by the Modified Z-score method:
   Z-score threshold: 3
   Lower bound: 5.538418022763285
   Upper bound: 36.19368140628174

0    22.637459
1    16.648512
2    19.218962
3    16.648512
4    21.444715
5    22.637459
6    21.016641
7    22.527376
8    20.502631
9    20.715458
Name: col1, dtype: float64

IQR 方法

这个方法是所有 3 种方法中最严格的。

def cap_outliers(series, iqr_threshold=1.5, verbose=False):
    '''Caps outliers to closest existing value within threshold (IQR).'''
    Q1 = series.quantile(0.25)
    Q3 = series.quantile(0.75)
    IQR = Q3 - Q1

    lbound = Q1 - iqr_threshold * IQR
    ubound = Q3 + iqr_threshold * IQR

    outliers = (series < lbound) | (series > ubound)

    series = series.copy()
    series.loc[series < lbound] = series.loc[~outliers].min()
    series.loc[series > ubound] = series.loc[~outliers].max()

    # For comparison purposes.
    if verbose:
            print('\n'.join(
                ['Capping outliers by the IQR method:',
                 f'   IQR threshold: {iqr_threshold}',
                 f'   Lower bound: {lbound}',
                 f'   Upper bound: {ubound}\n']))

    return series

cap_outliers(df['col1'], verbose=True)

输出:

Capping outliers by the IQR method:
   IQR threshold: 1.5
   Lower bound: 15.464630871041477
   Upper bound: 26.331958943979345

0    22.637459
1    16.648512
2    19.218962
3    16.648512
4    21.444715
5    22.637459
6    21.016641
7    22.527376
8    20.502631
9    20.715458
Name: col1, dtype: float64

结论

您或许应该使用 IQR 方法。