Python 中基于移动均值的离群值检测

Outlier detection based on the moving mean in Python

我正在尝试将算法从 MATLAB 转换为 Python。该算法适用于大型数据集,需要应用异常值检测和消除技术。

在MATLAB代码中,我使用的异常值删除技术是movmedian:

   Outlier_T=isoutlier(Data_raw.Temperatura,'movmedian',3);
   Data_raw(find(Outlier_T),:)=[]

通过在三值移动 window 的中心找到不成比例的值 来检测具有滚动中值的异常值。因此,如果我在第 3 行有一个带有 40 的“Temperatura”列,则会检测到它并删除整行。

         Temperatura     Date       
    1        24.72        2.3        
    2        25.76        4.6        
    3        40           7.0        
    4        25.31        9.3        
    5        26.21       15.6
    6        26.59       17.9        
   ...        ...         ...

据我了解,这是通过 pandas.DataFrame.rolling 实现的。我看过几篇文章举例说明了它的用途,但我无法让它与我的代码一起使用:

尝试A:

Dataframe.rolling(df["t_new"]))

尝试 B:

df-df.rolling(3).median().abs()>200

#基于@Ami Tavory 的

我是不是遗漏了什么明显的东西?正确的做法是什么? 谢谢你的时间。

下面的代码根据阈值删除行。这个阈值可以根据需要进行调整。不过不确定它是否复制了 Matlab 代码。

# Import Libraries
import pandas as pd
import numpy as np

# Create DataFrame
df = pd.DataFrame({
    'Temperatura': [24.72, 25.76, 40, 25.31, 26.21, 26.59],
    'Date':[2.3,4.6,7.0,9.3,15.6,17.9]
})

# Set threshold for difference with rolling median
upper_threshold = 1
lower_threshold = -1

# Calculate rolling median
df['rolling_temp'] = df['Temperatura'].rolling(window=3).median()

# Calculate difference
df['diff'] = df['Temperatura'] - df['rolling_temp']

# Flag rows to be dropped as `1`
df['drop_flag'] = np.where((df['diff']>upper_threshold)|(df['diff']<lower_threshold),1,0)

# Drop flagged rows
df = df[df['drop_flag']!=1]
df = df.drop(['rolling_temp', 'rolling_temp', 'diff', 'drop_flag'],axis=1)

输出

print(df)

   Temperatura  Date
0        24.72   2.3
1        25.76   4.6
3        25.31   9.3
4        26.21  15.6
5        26.59  17.9

Nilesh 的回答非常完美,要迭代他的代码,您也可以这样做:

upper_threshold = 1
lower_threshold = -1

# Calculate rolling median
df['rolling_temp'] = df['Temp'].rolling(window=3).median()
# all in one line 
df = df.drop(df[(df['Temp']-df['rolling_temp']>upper_threshold)|(df['Temp']- df['rolling_temp']<lower_threshold)].index) 
# if you want to drop the column as well
del df["rolling_temp"]

派对迟到了,根据 Nilesh Ingle 的回答。修改得更一般、更详细(图表!)和百分比阈值而不是数据的实际值。

# Calculate rolling median
df["Temp_Rolling"] = df["Temp"].rolling(window=3).median()

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
df["Temp_Rolling"] = scaler.fit_transform(df["Temp_Rolling"].values.reshape(-1, 1))

# Calculate difference
df["Temp_Diff"] = df_scaled["Temp"] - df["Temp_Rolling"]

import numpy as np
import matplotlib.pyplot as plt

# Set threshold for difference with rolling median
upper_threshold = 0.4
lower_threshold = -0.4

# Flag rows to be keepped True
df["Temp_Keep_Flag"] = np.where( (df["Temp_Diff"] > upper_threshold) | (df["Temp_Diff"] < lower_threshold), False, True)

# Keep flagged rows
print('dropped rows')
print(df[~df["Temp_Keep_Flag"]].index)
print('Your new graph')
df_result = df[df["Temp_Keep_Flag"].values]
df_result["Temp"].plot()

一旦您对数据清理感到满意

# Satisfied, replace data
df = df[df["Temp_Keep_Flag"].values]
df.drop(columns=["Temp_Rolling", "Temp_Diff", "Temp_Keep_Flag"], inplace=True)
df.plot()