根据数据框中的列变量或多索引删除异常值
Removing outliers based on column variables or multi-index in a dataframe
这是另一个 IQR 异常值问题。我有一个看起来像这样的数据框:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0,100,size=(100, 3)), columns=('red','yellow','green'))
df.loc[0:49,'Season'] = 'Spring'
df.loc[50:99,'Season'] = 'Fall'
df.loc[0:24,'Treatment'] = 'Placebo'
df.loc[25:49,'Treatment'] = 'Drug'
df.loc[50:74,'Treatment'] = 'Placebo'
df.loc[75:99,'Treatment'] = 'Drug'
df = df[['Season','Treatment','red','yellow','green']]
df
我想找到并删除每种情况下的异常值(即 Spring 安慰剂、Spring 药物等)。不是整行,只是单元格。并且想为 'red'、'yellow'、'green' 列中的每个列执行此操作。
有没有办法在不将数据帧分解为一大堆子数据帧且所有条件都单独分解的情况下做到这一点?如果将 'Season' 和 'Treatment' 作为列或索引处理,我不确定这是否会更容易。我对任何一种方式都很好。
我已经用 .iloc 和 .loc 尝试了一些东西,但我似乎无法让它工作。
如果需要用缺失值替换离群值使用GroupBy.transform
with DataFrame.quantile
, then compare for lower and greater values by DataFrame.lt
and DataFrame.gt
, chain masks by |
for bitwise OR
and set missing values in DataFrame.mask
,默认替换,所以没有指定:
np.random.seed(2020)
df = pd.DataFrame(np.random.randint(0,100,size=(100, 3)), columns=('red','yellow','green'))
df.loc[0:49,'Season'] = 'Spring'
df.loc[50:99,'Season'] = 'Fall'
df.loc[0:24,'Treatment'] = 'Placebo'
df.loc[25:49,'Treatment'] = 'Drug'
df.loc[50:74,'Treatment'] = 'Placebo'
df.loc[75:99,'Treatment'] = 'Drug'
df = df[['Season','Treatment','red','yellow','green']]
g = df.groupby(['Season','Treatment'])
df1 = g.transform('quantile', 0.05)
df2 = g.transform('quantile', 0.95)
c = df.columns.difference(['Season','Treatment'])
mask = df[c].lt(df1) | df[c].gt(df2)
df[c] = df[c].mask(mask)
print (df)
Season Treatment red yellow green
0 Spring Placebo NaN NaN 67.0
1 Spring Placebo 67.0 91.0 3.0
2 Spring Placebo 71.0 56.0 29.0
3 Spring Placebo 48.0 32.0 24.0
4 Spring Placebo 74.0 9.0 51.0
.. ... ... ... ... ...
95 Fall Drug 90.0 35.0 55.0
96 Fall Drug 40.0 55.0 90.0
97 Fall Drug NaN 54.0 NaN
98 Fall Drug 28.0 50.0 74.0
99 Fall Drug NaN 73.0 11.0
[100 rows x 5 columns]
这是另一个 IQR 异常值问题。我有一个看起来像这样的数据框:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0,100,size=(100, 3)), columns=('red','yellow','green'))
df.loc[0:49,'Season'] = 'Spring'
df.loc[50:99,'Season'] = 'Fall'
df.loc[0:24,'Treatment'] = 'Placebo'
df.loc[25:49,'Treatment'] = 'Drug'
df.loc[50:74,'Treatment'] = 'Placebo'
df.loc[75:99,'Treatment'] = 'Drug'
df = df[['Season','Treatment','red','yellow','green']]
df
我想找到并删除每种情况下的异常值(即 Spring 安慰剂、Spring 药物等)。不是整行,只是单元格。并且想为 'red'、'yellow'、'green' 列中的每个列执行此操作。
有没有办法在不将数据帧分解为一大堆子数据帧且所有条件都单独分解的情况下做到这一点?如果将 'Season' 和 'Treatment' 作为列或索引处理,我不确定这是否会更容易。我对任何一种方式都很好。
我已经用 .iloc 和 .loc 尝试了一些东西,但我似乎无法让它工作。
如果需要用缺失值替换离群值使用GroupBy.transform
with DataFrame.quantile
, then compare for lower and greater values by DataFrame.lt
and DataFrame.gt
, chain masks by |
for bitwise OR
and set missing values in DataFrame.mask
,默认替换,所以没有指定:
np.random.seed(2020)
df = pd.DataFrame(np.random.randint(0,100,size=(100, 3)), columns=('red','yellow','green'))
df.loc[0:49,'Season'] = 'Spring'
df.loc[50:99,'Season'] = 'Fall'
df.loc[0:24,'Treatment'] = 'Placebo'
df.loc[25:49,'Treatment'] = 'Drug'
df.loc[50:74,'Treatment'] = 'Placebo'
df.loc[75:99,'Treatment'] = 'Drug'
df = df[['Season','Treatment','red','yellow','green']]
g = df.groupby(['Season','Treatment'])
df1 = g.transform('quantile', 0.05)
df2 = g.transform('quantile', 0.95)
c = df.columns.difference(['Season','Treatment'])
mask = df[c].lt(df1) | df[c].gt(df2)
df[c] = df[c].mask(mask)
print (df)
Season Treatment red yellow green
0 Spring Placebo NaN NaN 67.0
1 Spring Placebo 67.0 91.0 3.0
2 Spring Placebo 71.0 56.0 29.0
3 Spring Placebo 48.0 32.0 24.0
4 Spring Placebo 74.0 9.0 51.0
.. ... ... ... ... ...
95 Fall Drug 90.0 35.0 55.0
96 Fall Drug 40.0 55.0 90.0
97 Fall Drug NaN 54.0 NaN
98 Fall Drug 28.0 50.0 74.0
99 Fall Drug NaN 73.0 11.0
[100 rows x 5 columns]