使用滚动中位数过滤掉 Pandas 数据框中的异常值
Filtering out outliers in Pandas dataframe with rolling median
我正在尝试从带日期的 GPS 海拔位移散点图中过滤掉一些异常值
我正在尝试使用 df.rolling 计算每个 window 的中位数和标准差,然后如果该点大于 3 个标准差,则删除该点。
但是,我想不出一种方法来遍历该列并比较滚动计算的中值。
这是我目前的代码
import pandas as pd
import numpy as np
def median_filter(df, window):
cnt = 0
median = df['b'].rolling(window).median()
std = df['b'].rolling(window).std()
for row in df.b:
#compare each value to its median
df = pd.DataFrame(np.random.randint(0,100,size=(100,2)), columns = ['a', 'b'])
median_filter(df, 10)
如何遍历并比较每个点并将其删除?
很可能有一个更 pandastic 的方法来做到这一点 - 这有点 hack,依赖于将原始 df 的索引映射到每个滚动 window 的手动方式。 (我选择了 6 号)。直到第 6 行的记录与 first window 关联;第 7 行是第二个 window,依此类推。
n = 100
df = pd.DataFrame(np.random.randint(0,n,size=(n,2)), columns = ['a','b'])
## set window size
window=6
std = 1 # I set it at just 1; with real data and larger windows, can be larger
## create df with rolling stats, upper and lower bounds
bounds = pd.DataFrame({'median':df['b'].rolling(window).median(),
'std':df['b'].rolling(window).std()})
bounds['upper']=bounds['median']+bounds['std']*std
bounds['lower']=bounds['median']-bounds['std']*std
## here, we set an identifier for each window which maps to the original df
## the first six rows are the first window; then each additional row is a new window
bounds['window_id']=np.append(np.zeros(window),np.arange(1,n-window+1))
## then we can assign the original 'b' value back to the bounds df
bounds['b']=df['b']
## and finally, keep only rows where b falls within the desired bounds
bounds.loc[bounds.eval("lower<b<upper")]
只过滤数据帧
df['median']= df['b'].rolling(window).median()
df['std'] = df['b'].rolling(window).std()
#filter setup
df = df[(df.b <= df['median']+3*df['std']) & (df.b >= df['median']-3*df['std'])]
这是我对创建中值过滤器的看法:
def median_filter(num_std=3):
def _median_filter(x):
_median = np.median(x)
_std = np.std(x)
s = x[-1]
return s if s >= _median - num_std * _std and s <= _median + num_std * _std else np.nan
return _median_filter
df.y.rolling(window).apply(median_filter(num_std=3), raw=True)
我正在尝试从带日期的 GPS 海拔位移散点图中过滤掉一些异常值
我正在尝试使用 df.rolling 计算每个 window 的中位数和标准差,然后如果该点大于 3 个标准差,则删除该点。
但是,我想不出一种方法来遍历该列并比较滚动计算的中值。
这是我目前的代码
import pandas as pd
import numpy as np
def median_filter(df, window):
cnt = 0
median = df['b'].rolling(window).median()
std = df['b'].rolling(window).std()
for row in df.b:
#compare each value to its median
df = pd.DataFrame(np.random.randint(0,100,size=(100,2)), columns = ['a', 'b'])
median_filter(df, 10)
如何遍历并比较每个点并将其删除?
很可能有一个更 pandastic 的方法来做到这一点 - 这有点 hack,依赖于将原始 df 的索引映射到每个滚动 window 的手动方式。 (我选择了 6 号)。直到第 6 行的记录与 first window 关联;第 7 行是第二个 window,依此类推。
n = 100
df = pd.DataFrame(np.random.randint(0,n,size=(n,2)), columns = ['a','b'])
## set window size
window=6
std = 1 # I set it at just 1; with real data and larger windows, can be larger
## create df with rolling stats, upper and lower bounds
bounds = pd.DataFrame({'median':df['b'].rolling(window).median(),
'std':df['b'].rolling(window).std()})
bounds['upper']=bounds['median']+bounds['std']*std
bounds['lower']=bounds['median']-bounds['std']*std
## here, we set an identifier for each window which maps to the original df
## the first six rows are the first window; then each additional row is a new window
bounds['window_id']=np.append(np.zeros(window),np.arange(1,n-window+1))
## then we can assign the original 'b' value back to the bounds df
bounds['b']=df['b']
## and finally, keep only rows where b falls within the desired bounds
bounds.loc[bounds.eval("lower<b<upper")]
只过滤数据帧
df['median']= df['b'].rolling(window).median()
df['std'] = df['b'].rolling(window).std()
#filter setup
df = df[(df.b <= df['median']+3*df['std']) & (df.b >= df['median']-3*df['std'])]
这是我对创建中值过滤器的看法:
def median_filter(num_std=3):
def _median_filter(x):
_median = np.median(x)
_std = np.std(x)
s = x[-1]
return s if s >= _median - num_std * _std and s <= _median + num_std * _std else np.nan
return _median_filter
df.y.rolling(window).apply(median_filter(num_std=3), raw=True)