如何通过使用 pandas 进行转换来不删除但处理异常值?
How to not remove but handle outliers by transforming using pandas?
我有一个如下所示的数据框
dfx = pd.DataFrame({'min_temp' :[-138,36,34,38,237,339]})
正如您在下面看到的,此数据中存在三个异常值 -138
、237
和 239
我想做的是识别记录
a) 大于3 standard deviation
并用有效最大值替换它们(考虑数据范围)。
b) 小于 -3 standard deviation
并用有效的最小值替换它们(考虑数据范围)。
这是我尝试过的方法,但它不正确且效率不高
dfx.apply(lambda x: x[(x < dfx[min_temp].mean()-3*dfx[min_temp].std(), dfx[min_temp].mean()+3*dfx[min_temp].std())])
在上面的示例中,38 是最大值,因为它在 3sd
限制内并且是有效的最大值(意味着不是异常值)。同样,36 是最小值,因为它在 -3sd
范围内
我们需要用它来替换完整数据框中的所有异常值。
请注意,在我的真实数据中,我有超过 60 列和 100 万行。我想在所有列中执行此操作。任何高效且可扩展的方法都是有帮助的
我希望我的输出是这样的?您可以看到如何用 maximum valid value within 3sd (38 in this case)
替换异常值
你能帮我解决这个问题吗?
根据建议的解决方案更新
这是一个广义函数,它遵循以下逻辑来检测 非 异常值。
此函数将数据框作为参数,因此请确保您只有数字列。
for each data point X: abs(X - mean) <= (std * 3)
或者换句话说:
residual <= 3*std
def replace_outliers(df, n_std):
outliers = df.sub(df.mean()).abs().le(df.std().mul(n_std))
outliers_nan = df.where(outliers)
outliers_replaced = outliers_nan.fillna(outliers_nan.max())
return outliers_replaced
测试:
dfx = pd.DataFrame({'min_temp' : [10,12,12,13,12,11,14,13,15,10,10,10,100,12,14,13]})
# replaces 100 with 15
replace_outliers(dfx, 3)
此答案基于 this 一篇关于离群值检测的好文章中的信息。您可以在那里阅读每种方法。
每个代码的输出显示异常值检测的结果下限和上限。
首先,让我们定义一些示例数据:
import numpy as np
df = pd.DataFrame({'col1': np.random.normal(loc=20, scale=2, size=10)})
# Insert outliers
df['col1'][0] = 40
df['col1'][1] = 0
df['col1']
输出:
0 40.000000
1 0.000000
2 19.218962
3 16.648512
4 21.444715
5 22.637459
6 21.016641
7 22.527376
8 20.502631
9 20.715458
Name: col1, dtype: float64
Z 分数法
此方法是所有 3 种方法中最不可靠的。它不适用于小型数据集(均值和标准差受异常值的严重影响)。
def cap_outliers(series, zscore_threshold=3, verbose=False):
'''Caps outliers to closest existing value within threshold (Z-score).'''
mean_val = series.mean()
std_val = series.std()
z_score = (series - mean_val) / std_val
outliers = abs(z_score) > zscore_threshold
series = series.copy()
series.loc[z_score > zscore_threshold] = series.loc[~outliers].max()
series.loc[z_score < -zscore_threshold] = series.loc[~outliers].min()
# For comparison purposes.
if verbose:
lbound = mean_val - zscore_threshold * std_val
ubound = mean_val + zscore_threshold * std_val
print('\n'.join(
['Capping outliers by the Z-score method:',
f' Z-score threshold: {zscore_threshold}',
f' Lower bound: {lbound}',
f' Upper bound: {ubound}\n']))
return series
cap_outliers(df['col1'], verbose=True)
输出:
Capping outliers by the Z-score method:
Z-score threshold: 3
Lower bound: -8.28385086324063
Upper bound: 49.22620154113844
0 40.000000
1 0.000000
2 19.218962
3 16.648512
4 21.444715
5 22.637459
6 21.016641
7 22.527376
8 20.502631
9 20.715458
Name: col1, dtype: float64
改进的 Z 分数方法
这种方法比前一种方法更可靠。它使用 median 和 mad 而不是 mean 和 std。
def cap_outliers(series, zscore_threshold=3, verbose=False):
'''Caps outliers to closest existing value within threshold (Modified Z-score).'''
median_val = series.median()
mad_val = series.mad() # Median absolute deviation
z_score = (series - median_val) / mad_val
outliers = abs(z_score) > zscore_threshold
series = series.copy()
series.loc[z_score > zscore_threshold] = series.loc[~outliers].max()
series.loc[z_score < -zscore_threshold] = series.loc[~outliers].min()
# For comparison purposes.
if verbose:
lbound = median_val - zscore_threshold * mad_val
ubound = median_val + zscore_threshold * mad_val
print('\n'.join(
['Capping outliers by the Modified Z-score method:',
f' Z-score threshold: {zscore_threshold}',
f' Lower bound: {lbound}',
f' Upper bound: {ubound}\n']))
return series
cap_outliers(df['col1'], verbose=True)
输出:
Capping outliers by the Modified Z-score method:
Z-score threshold: 3
Lower bound: 5.538418022763285
Upper bound: 36.19368140628174
0 22.637459
1 16.648512
2 19.218962
3 16.648512
4 21.444715
5 22.637459
6 21.016641
7 22.527376
8 20.502631
9 20.715458
Name: col1, dtype: float64
IQR 方法
这个方法是所有 3 种方法中最严格的。
def cap_outliers(series, iqr_threshold=1.5, verbose=False):
'''Caps outliers to closest existing value within threshold (IQR).'''
Q1 = series.quantile(0.25)
Q3 = series.quantile(0.75)
IQR = Q3 - Q1
lbound = Q1 - iqr_threshold * IQR
ubound = Q3 + iqr_threshold * IQR
outliers = (series < lbound) | (series > ubound)
series = series.copy()
series.loc[series < lbound] = series.loc[~outliers].min()
series.loc[series > ubound] = series.loc[~outliers].max()
# For comparison purposes.
if verbose:
print('\n'.join(
['Capping outliers by the IQR method:',
f' IQR threshold: {iqr_threshold}',
f' Lower bound: {lbound}',
f' Upper bound: {ubound}\n']))
return series
cap_outliers(df['col1'], verbose=True)
输出:
Capping outliers by the IQR method:
IQR threshold: 1.5
Lower bound: 15.464630871041477
Upper bound: 26.331958943979345
0 22.637459
1 16.648512
2 19.218962
3 16.648512
4 21.444715
5 22.637459
6 21.016641
7 22.527376
8 20.502631
9 20.715458
Name: col1, dtype: float64
结论
您或许应该使用 IQR 方法。
我有一个如下所示的数据框
dfx = pd.DataFrame({'min_temp' :[-138,36,34,38,237,339]})
正如您在下面看到的,此数据中存在三个异常值 -138
、237
和 239
我想做的是识别记录
a) 大于3 standard deviation
并用有效最大值替换它们(考虑数据范围)。
b) 小于 -3 standard deviation
并用有效的最小值替换它们(考虑数据范围)。
这是我尝试过的方法,但它不正确且效率不高
dfx.apply(lambda x: x[(x < dfx[min_temp].mean()-3*dfx[min_temp].std(), dfx[min_temp].mean()+3*dfx[min_temp].std())])
在上面的示例中,38 是最大值,因为它在 3sd
限制内并且是有效的最大值(意味着不是异常值)。同样,36 是最小值,因为它在 -3sd
我们需要用它来替换完整数据框中的所有异常值。
请注意,在我的真实数据中,我有超过 60 列和 100 万行。我想在所有列中执行此操作。任何高效且可扩展的方法都是有帮助的
我希望我的输出是这样的?您可以看到如何用 maximum valid value within 3sd (38 in this case)
你能帮我解决这个问题吗?
根据建议的解决方案更新
这是一个广义函数,它遵循以下逻辑来检测 非 异常值。
此函数将数据框作为参数,因此请确保您只有数字列。
for each data point X:
abs(X - mean) <= (std * 3)
或者换句话说:
residual <= 3*std
def replace_outliers(df, n_std):
outliers = df.sub(df.mean()).abs().le(df.std().mul(n_std))
outliers_nan = df.where(outliers)
outliers_replaced = outliers_nan.fillna(outliers_nan.max())
return outliers_replaced
测试:
dfx = pd.DataFrame({'min_temp' : [10,12,12,13,12,11,14,13,15,10,10,10,100,12,14,13]})
# replaces 100 with 15
replace_outliers(dfx, 3)
此答案基于 this 一篇关于离群值检测的好文章中的信息。您可以在那里阅读每种方法。
每个代码的输出显示异常值检测的结果下限和上限。
首先,让我们定义一些示例数据:
import numpy as np
df = pd.DataFrame({'col1': np.random.normal(loc=20, scale=2, size=10)})
# Insert outliers
df['col1'][0] = 40
df['col1'][1] = 0
df['col1']
输出:
0 40.000000
1 0.000000
2 19.218962
3 16.648512
4 21.444715
5 22.637459
6 21.016641
7 22.527376
8 20.502631
9 20.715458
Name: col1, dtype: float64
Z 分数法
此方法是所有 3 种方法中最不可靠的。它不适用于小型数据集(均值和标准差受异常值的严重影响)。
def cap_outliers(series, zscore_threshold=3, verbose=False):
'''Caps outliers to closest existing value within threshold (Z-score).'''
mean_val = series.mean()
std_val = series.std()
z_score = (series - mean_val) / std_val
outliers = abs(z_score) > zscore_threshold
series = series.copy()
series.loc[z_score > zscore_threshold] = series.loc[~outliers].max()
series.loc[z_score < -zscore_threshold] = series.loc[~outliers].min()
# For comparison purposes.
if verbose:
lbound = mean_val - zscore_threshold * std_val
ubound = mean_val + zscore_threshold * std_val
print('\n'.join(
['Capping outliers by the Z-score method:',
f' Z-score threshold: {zscore_threshold}',
f' Lower bound: {lbound}',
f' Upper bound: {ubound}\n']))
return series
cap_outliers(df['col1'], verbose=True)
输出:
Capping outliers by the Z-score method:
Z-score threshold: 3
Lower bound: -8.28385086324063
Upper bound: 49.22620154113844
0 40.000000
1 0.000000
2 19.218962
3 16.648512
4 21.444715
5 22.637459
6 21.016641
7 22.527376
8 20.502631
9 20.715458
Name: col1, dtype: float64
改进的 Z 分数方法
这种方法比前一种方法更可靠。它使用 median 和 mad 而不是 mean 和 std。
def cap_outliers(series, zscore_threshold=3, verbose=False):
'''Caps outliers to closest existing value within threshold (Modified Z-score).'''
median_val = series.median()
mad_val = series.mad() # Median absolute deviation
z_score = (series - median_val) / mad_val
outliers = abs(z_score) > zscore_threshold
series = series.copy()
series.loc[z_score > zscore_threshold] = series.loc[~outliers].max()
series.loc[z_score < -zscore_threshold] = series.loc[~outliers].min()
# For comparison purposes.
if verbose:
lbound = median_val - zscore_threshold * mad_val
ubound = median_val + zscore_threshold * mad_val
print('\n'.join(
['Capping outliers by the Modified Z-score method:',
f' Z-score threshold: {zscore_threshold}',
f' Lower bound: {lbound}',
f' Upper bound: {ubound}\n']))
return series
cap_outliers(df['col1'], verbose=True)
输出:
Capping outliers by the Modified Z-score method:
Z-score threshold: 3
Lower bound: 5.538418022763285
Upper bound: 36.19368140628174
0 22.637459
1 16.648512
2 19.218962
3 16.648512
4 21.444715
5 22.637459
6 21.016641
7 22.527376
8 20.502631
9 20.715458
Name: col1, dtype: float64
IQR 方法
这个方法是所有 3 种方法中最严格的。
def cap_outliers(series, iqr_threshold=1.5, verbose=False):
'''Caps outliers to closest existing value within threshold (IQR).'''
Q1 = series.quantile(0.25)
Q3 = series.quantile(0.75)
IQR = Q3 - Q1
lbound = Q1 - iqr_threshold * IQR
ubound = Q3 + iqr_threshold * IQR
outliers = (series < lbound) | (series > ubound)
series = series.copy()
series.loc[series < lbound] = series.loc[~outliers].min()
series.loc[series > ubound] = series.loc[~outliers].max()
# For comparison purposes.
if verbose:
print('\n'.join(
['Capping outliers by the IQR method:',
f' IQR threshold: {iqr_threshold}',
f' Lower bound: {lbound}',
f' Upper bound: {ubound}\n']))
return series
cap_outliers(df['col1'], verbose=True)
输出:
Capping outliers by the IQR method:
IQR threshold: 1.5
Lower bound: 15.464630871041477
Upper bound: 26.331958943979345
0 22.637459
1 16.648512
2 19.218962
3 16.648512
4 21.444715
5 22.637459
6 21.016641
7 22.527376
8 20.502631
9 20.715458
Name: col1, dtype: float64
结论
您或许应该使用 IQR 方法。