如何删除异常值

How to delete the outliers

我设法很好地应用了四分位数范围原则,但是当我显示没有离群值的数据集的胡子框时,我发现总是有离群值。怎么了? 这是代码:

# Load libraries
import pandas as pd;
from pandas import read_csv, set_option;
from matplotlib import pyplot as plt;

# Load dataset
filename         = "/home/fogang/dataset/Regression/Housing Boston/housing.csv";
df               = read_csv(filename, header=0);
df = df.drop('Unnamed: 0', axis=1);  # Let's delete the column 'Unnamed: 0'
one_dim         = pd.DataFrame();
one_dim['rm']    = df['rm'];

#shape dataset
print(one_dim.shape);

# Peek at dataset
print(one_dim.head(10));

# Let's look whether there are NaN values
print(one_dim.isnull().sum());

# Box and whisker plots
one_dim.plot(kind='box', subplots=True, layout=(1, 1), sharex=False, sharey=False, fontsize=12);
plt.show();

# Describe Dataset
print(one_dim.describe());

# Let's find Inter-Quartile Range
unidim        = one_dim['rm'];
unidim_Q1     = unidim.quantile(0.25);
unidim_Q3     = unidim.quantile(0.75);
unidim_IQR    = unidim_Q3 - unidim_Q1;
unidim_lower  = unidim_Q1 - (1.5 * unidim_IQR);
unidim_upper  = unidim_Q3 + (1.5 * unidim_IQR);

# Outliers
unidim_outliers  = pd.DataFrame();
unidim_outliers['outliers'] = unidim[(unidim < unidim_lower) | (unidim > unidim_upper)]
unidim_outliers.info()

# Good data
unidim_good  = pd.DataFrame();
unidim_good['good'] = unidim[(unidim >= unidim_lower) & (unidim <= unidim_upper)];
unidim_good.info();

unidim_good.plot(kind='box', subplots=True, layout=(1, 2), sharex=False, sharey=False, fontsize=12);
plt.show();

怎么办?

上下两端的离群值分布太广。所以,然后你切掉一些异常值并再次检查,你在切割数据中有新的异常值。 如果你想通过一次切割完全去除异常值,你可以使用更严格的切割规则来做到这一点,例如:

unidim_lower  = unidim_Q1 - (1.3 * unidim_IQR);
unidim_upper  = unidim_Q3 + (1.3 * unidim_IQR);

但我要警告你:并非所有 'outliers' 都对模型不利,你应该明智地选择将什么视为 'ouliers' 以及什么是有用的数据。