Resample/aggregate 间隔 pandas
Resample/aggregate intervals in pandas
对于包含每个项目的活动时间间隔的给定数据框,我想计算一段时间内活动项目的总数(可能重新采样)。
例如,对于数据框
df = pd.DataFrame({
'item': ['a', 'b', 'c', 'd'],
'active': [
pd.Interval(pd.Timestamp('2021-04-01 00:00:00'), pd.Timestamp('2021-04-05 00:00:05')),
pd.Interval(pd.Timestamp('2021-04-01 00:30:00'), pd.Timestamp('2021-04-01 01:30:00')),
pd.Interval(pd.Timestamp('2021-04-01 01:00:00'), pd.Timestamp('2021-04-02 02:00:00')),
pd.Interval(pd.Timestamp('2021-04-01 01:00:00'), pd.Timestamp('2021-04-01 01:00:05'))]})
于 2021-04-01 00:45:00,有两个活动项目(a
和 b
),在 2021-04-03 01:00:00 只有一个( a
).
我该怎么做?
从 pandas.Interval(), you can use in
operator to check if a pd.Timestamp()
is in pd.Interval()
. So you can use apply()
on each row to check if the date to check is in active
column. Then use boolean indexing 检索所需的行。
date_to_check = '2021-04-01 00:45:00'
df.loc[df.apply(lambda row: pd.Timestamp(date_to_check) in row['active'] , axis=1)]
'''
item active
0 a (2021-04-01, 2021-04-05 00:00:05]
1 b (2021-04-01 00:30:00, 2021-04-01 01:30:00]
'''
我认为还没有实现,所以使用:
s = pd.concat([pd.Series(r.item,pd.date_range(r.active.left,r.active.right, freq='15Min'))
for r in df.itertuples()])
print (s)
2021-04-01 00:00:00 a
2021-04-01 00:15:00 a
2021-04-01 00:30:00 a
2021-04-01 00:45:00 a
2021-04-01 01:00:00 a
..
2021-04-02 01:15:00 c
2021-04-02 01:30:00 c
2021-04-02 01:45:00 c
2021-04-02 02:00:00 c
2021-04-01 01:00:00 d
Length: 492, dtype: object
然后:
s = s.groupby(level=0).size()
print (s)
2021-04-01 00:00:00 1
2021-04-01 00:15:00 1
2021-04-01 00:30:00 2
2021-04-01 00:45:00 2
2021-04-01 01:00:00 4
..
2021-04-04 23:00:00 1
2021-04-04 23:15:00 1
2021-04-04 23:30:00 1
2021-04-04 23:45:00 1
2021-04-05 00:00:00 1
Length: 385, dtype: int64
对于包含每个项目的活动时间间隔的给定数据框,我想计算一段时间内活动项目的总数(可能重新采样)。
例如,对于数据框
df = pd.DataFrame({
'item': ['a', 'b', 'c', 'd'],
'active': [
pd.Interval(pd.Timestamp('2021-04-01 00:00:00'), pd.Timestamp('2021-04-05 00:00:05')),
pd.Interval(pd.Timestamp('2021-04-01 00:30:00'), pd.Timestamp('2021-04-01 01:30:00')),
pd.Interval(pd.Timestamp('2021-04-01 01:00:00'), pd.Timestamp('2021-04-02 02:00:00')),
pd.Interval(pd.Timestamp('2021-04-01 01:00:00'), pd.Timestamp('2021-04-01 01:00:05'))]})
于 2021-04-01 00:45:00,有两个活动项目(a
和 b
),在 2021-04-03 01:00:00 只有一个( a
).
我该怎么做?
从 pandas.Interval(), you can use in
operator to check if a pd.Timestamp()
is in pd.Interval()
. So you can use apply()
on each row to check if the date to check is in active
column. Then use boolean indexing 检索所需的行。
date_to_check = '2021-04-01 00:45:00'
df.loc[df.apply(lambda row: pd.Timestamp(date_to_check) in row['active'] , axis=1)]
'''
item active
0 a (2021-04-01, 2021-04-05 00:00:05]
1 b (2021-04-01 00:30:00, 2021-04-01 01:30:00]
'''
我认为还没有实现,所以使用:
s = pd.concat([pd.Series(r.item,pd.date_range(r.active.left,r.active.right, freq='15Min'))
for r in df.itertuples()])
print (s)
2021-04-01 00:00:00 a
2021-04-01 00:15:00 a
2021-04-01 00:30:00 a
2021-04-01 00:45:00 a
2021-04-01 01:00:00 a
..
2021-04-02 01:15:00 c
2021-04-02 01:30:00 c
2021-04-02 01:45:00 c
2021-04-02 02:00:00 c
2021-04-01 01:00:00 d
Length: 492, dtype: object
然后:
s = s.groupby(level=0).size()
print (s)
2021-04-01 00:00:00 1
2021-04-01 00:15:00 1
2021-04-01 00:30:00 2
2021-04-01 00:45:00 2
2021-04-01 01:00:00 4
..
2021-04-04 23:00:00 1
2021-04-04 23:15:00 1
2021-04-04 23:30:00 1
2021-04-04 23:45:00 1
2021-04-05 00:00:00 1
Length: 385, dtype: int64