如何 trim python 中的日期异常值?

How to trim outliers in dates in python?

我有一个数据框 df:

0    2003-01-02
1    2015-10-31
2    2015-11-01
16   2015-11-02
33   2015-11-03
44   2015-11-04

我想 trim 日期中的异常值。所以在这个例子中我想删除日期为 2003-01-02 的行。或者在更大的数据框中,我想删除不在 95% 或 99% 所在区间内的日期。是否有可以执行此操作的函数?

假设您已将列转换为日期时间格式:

import pandas as pd
import datetime as dt

df = pd.DataFrame(data)
df = pd.to_datetime(df[0])

你可以做到:

include = df[df.dt.year > 2003]
print(include)

[out]:
1   2015-10-31
2   2015-11-01
3   2015-11-02
4   2015-11-03
5   2015-11-04
Name: 0, dtype: datetime64[ns]

看看

...关于你的回答(基本上是一样的想法,...有创意我的朋友):

s = pd.Series(df)
s10 = s.quantile(.10)
s90 = s.quantile(.90)

my_filtered_data = df[df.dt.year >= s10.year]
my_filtered_data = my_filtered_data[my_filtered_data.dt.year <= s90.year]

您可以在 Series or DataFrame 上使用 quantile()

dates = [datetime.date(2003,1,2),
         datetime.date(2015,10,31),
         datetime.date(2015,11,1),
         datetime.date(2015,11,2),
         datetime.date(2015,11,3),
         datetime.date(2015,11,4)]
df = pd.DataFrame({'DATE': [pd.Timestamp(x) for x in dates]})
print(df)

qa = df['DATE'].quantile(0.1) #lower 10%
qb = df['DATE'].quantile(0.9) #higher 10%

print(qa, qb)

#remove outliers
xf = df[(df['DATE'] >= qa) & (df['DATE'] <= qb)]
print(xf)

输出为:

        DATE
0 2003-01-02
1 2015-10-31
2 2015-11-01
3 2015-11-02
4 2015-11-03
5 2015-11-04
2009-06-01 12:00:00 2015-11-03 12:00:00
        DATE
1 2015-10-31
2 2015-11-01
3 2015-11-02
4 2015-11-03