根据数据框中的列变量或多索引删除异常值

Question

这是另一个 IQR 异常值问题。我有一个看起来像这样的数据框：

import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.randint(0,100,size=(100, 3)), columns=('red','yellow','green'))
df.loc[0:49,'Season'] = 'Spring'
df.loc[50:99,'Season'] = 'Fall'
df.loc[0:24,'Treatment'] = 'Placebo'
df.loc[25:49,'Treatment'] = 'Drug'
df.loc[50:74,'Treatment'] = 'Placebo'
df.loc[75:99,'Treatment'] = 'Drug'
df = df[['Season','Treatment','red','yellow','green']]
df

我想找到并删除每种情况下的异常值（即 Spring 安慰剂、Spring 药物等）。不是整行，只是单元格。并且想为 'red'、'yellow'、'green' 列中的每个列执行此操作。

有没有办法在不将数据帧分解为一大堆子数据帧且所有条件都单独分解的情况下做到这一点？如果将 'Season' 和 'Treatment' 作为列或索引处理，我不确定这是否会更容易。我对任何一种方式都很好。

我已经用 .iloc 和 .loc 尝试了一些东西，但我似乎无法让它工作。

Answer 1

如果需要用缺失值替换离群值使用GroupBy.transform with DataFrame.quantile, then compare for lower and greater values by DataFrame.lt and DataFrame.gt, chain masks by | for bitwise OR and set missing values in DataFrame.mask，默认替换，所以没有指定：

np.random.seed(2020)
df = pd.DataFrame(np.random.randint(0,100,size=(100, 3)), columns=('red','yellow','green'))
df.loc[0:49,'Season'] = 'Spring'
df.loc[50:99,'Season'] = 'Fall'
df.loc[0:24,'Treatment'] = 'Placebo'
df.loc[25:49,'Treatment'] = 'Drug'
df.loc[50:74,'Treatment'] = 'Placebo'
df.loc[75:99,'Treatment'] = 'Drug'
df = df[['Season','Treatment','red','yellow','green']]

g = df.groupby(['Season','Treatment'])
df1 = g.transform('quantile', 0.05)
df2 = g.transform('quantile', 0.95)

c = df.columns.difference(['Season','Treatment'])
mask = df[c].lt(df1) | df[c].gt(df2)
df[c] = df[c].mask(mask)

print (df)
    Season Treatment   red  yellow  green
0   Spring   Placebo   NaN     NaN   67.0
1   Spring   Placebo  67.0    91.0    3.0
2   Spring   Placebo  71.0    56.0   29.0
3   Spring   Placebo  48.0    32.0   24.0
4   Spring   Placebo  74.0     9.0   51.0
..     ...       ...   ...     ...    ...
95    Fall      Drug  90.0    35.0   55.0
96    Fall      Drug  40.0    55.0   90.0
97    Fall      Drug   NaN    54.0    NaN
98    Fall      Drug  28.0    50.0   74.0
99    Fall      Drug   NaN    73.0   11.0

[100 rows x 5 columns]

根据数据框中的列变量或多索引删除异常值

Removing outliers based on column variables or multi-index in a dataframe

outliers

multi-index

dataframe

python-3.x

pandas