Resample/aggregate 间隔 pandas

Resample/aggregate intervals in pandas

对于包含每个项目的活动时间间隔的给定数据框,我想计算一段时间内活动项目的总数(可能重新采样)。

例如,对于数据框

df = pd.DataFrame({
    'item': ['a', 'b', 'c', 'd'],
    'active': [
        pd.Interval(pd.Timestamp('2021-04-01 00:00:00'), pd.Timestamp('2021-04-05 00:00:05')),
        pd.Interval(pd.Timestamp('2021-04-01 00:30:00'), pd.Timestamp('2021-04-01 01:30:00')),
        pd.Interval(pd.Timestamp('2021-04-01 01:00:00'), pd.Timestamp('2021-04-02 02:00:00')),
        pd.Interval(pd.Timestamp('2021-04-01 01:00:00'), pd.Timestamp('2021-04-01 01:00:05'))]})

于 2021-04-01 00:45:00,有两个活动项目(ab),在 2021-04-03 01:00:00 只有一个( a).

我该怎么做?

pandas.Interval(), you can use in operator to check if a pd.Timestamp() is in pd.Interval(). So you can use apply() on each row to check if the date to check is in active column. Then use boolean indexing 检索所需的行。

date_to_check = '2021-04-01 00:45:00'
df.loc[df.apply(lambda row: pd.Timestamp(date_to_check) in row['active'] , axis=1)]

'''
  item                                      active
0    a           (2021-04-01, 2021-04-05 00:00:05]
1    b  (2021-04-01 00:30:00, 2021-04-01 01:30:00]
'''

我认为还没有实现,所以使用:

s = pd.concat([pd.Series(r.item,pd.date_range(r.active.left,r.active.right, freq='15Min')) 
                 for r in df.itertuples()])
print (s)
2021-04-01 00:00:00    a
2021-04-01 00:15:00    a
2021-04-01 00:30:00    a
2021-04-01 00:45:00    a
2021-04-01 01:00:00    a
                      ..
2021-04-02 01:15:00    c
2021-04-02 01:30:00    c
2021-04-02 01:45:00    c
2021-04-02 02:00:00    c
2021-04-01 01:00:00    d
Length: 492, dtype: object

然后:

s = s.groupby(level=0).size()
print (s)
2021-04-01 00:00:00    1
2021-04-01 00:15:00    1
2021-04-01 00:30:00    2
2021-04-01 00:45:00    2
2021-04-01 01:00:00    4
                      ..
2021-04-04 23:00:00    1
2021-04-04 23:15:00    1
2021-04-04 23:30:00    1
2021-04-04 23:45:00    1
2021-04-05 00:00:00    1
Length: 385, dtype: int64