Python 中基于移动均值的离群值检测
Outlier detection based on the moving mean in Python
我正在尝试将算法从 MATLAB 转换为 Python。该算法适用于大型数据集,需要应用异常值检测和消除技术。
在MATLAB代码中,我使用的异常值删除技术是movmedian:
Outlier_T=isoutlier(Data_raw.Temperatura,'movmedian',3);
Data_raw(find(Outlier_T),:)=[]
通过在三值移动 window 的中心找到不成比例的值 来检测具有滚动中值的异常值。因此,如果我在第 3 行有一个带有 40 的“Temperatura”列,则会检测到它并删除整行。
Temperatura Date
1 24.72 2.3
2 25.76 4.6
3 40 7.0
4 25.31 9.3
5 26.21 15.6
6 26.59 17.9
... ... ...
据我了解,这是通过 pandas.DataFrame.rolling 实现的。我看过几篇文章举例说明了它的用途,但我无法让它与我的代码一起使用:
尝试A:
Dataframe.rolling(df["t_new"]))
尝试 B:
df-df.rolling(3).median().abs()>200
#基于@Ami Tavory 的
我是不是遗漏了什么明显的东西?正确的做法是什么?
谢谢你的时间。
下面的代码根据阈值删除行。这个阈值可以根据需要进行调整。不过不确定它是否复制了 Matlab 代码。
# Import Libraries
import pandas as pd
import numpy as np
# Create DataFrame
df = pd.DataFrame({
'Temperatura': [24.72, 25.76, 40, 25.31, 26.21, 26.59],
'Date':[2.3,4.6,7.0,9.3,15.6,17.9]
})
# Set threshold for difference with rolling median
upper_threshold = 1
lower_threshold = -1
# Calculate rolling median
df['rolling_temp'] = df['Temperatura'].rolling(window=3).median()
# Calculate difference
df['diff'] = df['Temperatura'] - df['rolling_temp']
# Flag rows to be dropped as `1`
df['drop_flag'] = np.where((df['diff']>upper_threshold)|(df['diff']<lower_threshold),1,0)
# Drop flagged rows
df = df[df['drop_flag']!=1]
df = df.drop(['rolling_temp', 'rolling_temp', 'diff', 'drop_flag'],axis=1)
输出
print(df)
Temperatura Date
0 24.72 2.3
1 25.76 4.6
3 25.31 9.3
4 26.21 15.6
5 26.59 17.9
Nilesh 的回答非常完美,要迭代他的代码,您也可以这样做:
upper_threshold = 1
lower_threshold = -1
# Calculate rolling median
df['rolling_temp'] = df['Temp'].rolling(window=3).median()
# all in one line
df = df.drop(df[(df['Temp']-df['rolling_temp']>upper_threshold)|(df['Temp']- df['rolling_temp']<lower_threshold)].index)
# if you want to drop the column as well
del df["rolling_temp"]
派对迟到了,根据 Nilesh Ingle 的回答。修改得更一般、更详细(图表!)和百分比阈值而不是数据的实际值。
# Calculate rolling median
df["Temp_Rolling"] = df["Temp"].rolling(window=3).median()
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df["Temp_Rolling"] = scaler.fit_transform(df["Temp_Rolling"].values.reshape(-1, 1))
# Calculate difference
df["Temp_Diff"] = df_scaled["Temp"] - df["Temp_Rolling"]
import numpy as np
import matplotlib.pyplot as plt
# Set threshold for difference with rolling median
upper_threshold = 0.4
lower_threshold = -0.4
# Flag rows to be keepped True
df["Temp_Keep_Flag"] = np.where( (df["Temp_Diff"] > upper_threshold) | (df["Temp_Diff"] < lower_threshold), False, True)
# Keep flagged rows
print('dropped rows')
print(df[~df["Temp_Keep_Flag"]].index)
print('Your new graph')
df_result = df[df["Temp_Keep_Flag"].values]
df_result["Temp"].plot()
一旦您对数据清理感到满意
# Satisfied, replace data
df = df[df["Temp_Keep_Flag"].values]
df.drop(columns=["Temp_Rolling", "Temp_Diff", "Temp_Keep_Flag"], inplace=True)
df.plot()
我正在尝试将算法从 MATLAB 转换为 Python。该算法适用于大型数据集,需要应用异常值检测和消除技术。
在MATLAB代码中,我使用的异常值删除技术是movmedian:
Outlier_T=isoutlier(Data_raw.Temperatura,'movmedian',3);
Data_raw(find(Outlier_T),:)=[]
通过在三值移动 window 的中心找到不成比例的值 来检测具有滚动中值的异常值。因此,如果我在第 3 行有一个带有 40 的“Temperatura”列,则会检测到它并删除整行。
Temperatura Date
1 24.72 2.3
2 25.76 4.6
3 40 7.0
4 25.31 9.3
5 26.21 15.6
6 26.59 17.9
... ... ...
据我了解,这是通过 pandas.DataFrame.rolling 实现的。我看过几篇文章举例说明了它的用途,但我无法让它与我的代码一起使用:
尝试A:
Dataframe.rolling(df["t_new"]))
尝试 B:
df-df.rolling(3).median().abs()>200
#基于@Ami Tavory 的
我是不是遗漏了什么明显的东西?正确的做法是什么? 谢谢你的时间。
下面的代码根据阈值删除行。这个阈值可以根据需要进行调整。不过不确定它是否复制了 Matlab 代码。
# Import Libraries
import pandas as pd
import numpy as np
# Create DataFrame
df = pd.DataFrame({
'Temperatura': [24.72, 25.76, 40, 25.31, 26.21, 26.59],
'Date':[2.3,4.6,7.0,9.3,15.6,17.9]
})
# Set threshold for difference with rolling median
upper_threshold = 1
lower_threshold = -1
# Calculate rolling median
df['rolling_temp'] = df['Temperatura'].rolling(window=3).median()
# Calculate difference
df['diff'] = df['Temperatura'] - df['rolling_temp']
# Flag rows to be dropped as `1`
df['drop_flag'] = np.where((df['diff']>upper_threshold)|(df['diff']<lower_threshold),1,0)
# Drop flagged rows
df = df[df['drop_flag']!=1]
df = df.drop(['rolling_temp', 'rolling_temp', 'diff', 'drop_flag'],axis=1)
输出
print(df)
Temperatura Date
0 24.72 2.3
1 25.76 4.6
3 25.31 9.3
4 26.21 15.6
5 26.59 17.9
Nilesh 的回答非常完美,要迭代他的代码,您也可以这样做:
upper_threshold = 1
lower_threshold = -1
# Calculate rolling median
df['rolling_temp'] = df['Temp'].rolling(window=3).median()
# all in one line
df = df.drop(df[(df['Temp']-df['rolling_temp']>upper_threshold)|(df['Temp']- df['rolling_temp']<lower_threshold)].index)
# if you want to drop the column as well
del df["rolling_temp"]
派对迟到了,根据 Nilesh Ingle 的回答。修改得更一般、更详细(图表!)和百分比阈值而不是数据的实际值。
# Calculate rolling median
df["Temp_Rolling"] = df["Temp"].rolling(window=3).median()
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df["Temp_Rolling"] = scaler.fit_transform(df["Temp_Rolling"].values.reshape(-1, 1))
# Calculate difference
df["Temp_Diff"] = df_scaled["Temp"] - df["Temp_Rolling"]
import numpy as np
import matplotlib.pyplot as plt
# Set threshold for difference with rolling median
upper_threshold = 0.4
lower_threshold = -0.4
# Flag rows to be keepped True
df["Temp_Keep_Flag"] = np.where( (df["Temp_Diff"] > upper_threshold) | (df["Temp_Diff"] < lower_threshold), False, True)
# Keep flagged rows
print('dropped rows')
print(df[~df["Temp_Keep_Flag"]].index)
print('Your new graph')
df_result = df[df["Temp_Keep_Flag"].values]
df_result["Temp"].plot()
一旦您对数据清理感到满意
# Satisfied, replace data
df = df[df["Temp_Keep_Flag"].values]
df.drop(columns=["Temp_Rolling", "Temp_Diff", "Temp_Keep_Flag"], inplace=True)
df.plot()