删除异常值，同时保留数据框中的时间戳

Question

我在数据框中有如下格式的数据：

metric  timestamp              cas_pre        fl_rat       ...
0       2017-04-06 11:25:00    687.982849     1627.040283    ...
1       2017-04-06 11:30:00    693.427673     1506.217285    ...
2       2017-04-06 11:35:00    692.686310     1537.114807    ...
....
45      2017-04-06 11:35:00    51987.427673   1537.114807    ...
....
101003  2017-04-06 11:35:00    692.686310     1537.114807    ...

很明显第 45 行需要删除，因为它是一个异常。有多个列和相当多的行（100,000+）。现在我想从中删除异常值，并且一直使用代码：

drop_df = df.drop(columns=['timestamp'])
drop_df = drop_df[(np.abs(stats.zscore(drop_df)) < 3).all(axis=1)]

但是，这会给我没有时间戳的数据。这是因为我无法在 z-score 计算中使用时间戳。但是，我想保留时间戳，在使用 z 分数进行过滤时完全失去了相关性。如下所示：

metric  timestamp              cas_pre        fl_rat       ...
0       2017-04-06 11:25:00    687.982849     1627.040283    ...
1       2017-04-06 11:30:00    693.427673     1506.217285    ...
2       2017-04-06 11:35:00    692.686310     1537.114807    ...
....
101003  2017-04-06 11:35:00    692.686310     1537.114807    ...

我怎样才能做到这一点？

Answer 1

明确设置要用于 z-score 计算的列可能更好：

cols = ['cas_pre', 'fl_rat', ...]
df = df[(np.abs(stats.zscore(df[cols])) < 3).all(axis=1)]

或者，您可以仅在 z-score 计算的输入中删除时间戳列：

drop_df = df.drop(columns=['timestamp'])
df = df[(np.abs(stats.zscore(drop_df)) < 3).all(axis=1)]

删除异常值，同时保留数据框中的时间戳

Remove outliers while preserving the timestamps in dataframe

python

statistics

outliers

dataframe

pandas