Python 根据条件删除 groupby 对象中的子组
Python remove subgroups in a groupby object based on condition
我有一个时间序列 DataFrame,它涉及具有 3 级层次结构(即 3 个 id 列)的多个组,以及一个日期列和一个值列。我编写的代码将它们分组,结果示例如下所示:
grp = df.groupby(['level0','level1','level2','date'])
values
level0 level1 level2 date
A AA AA_1 2006-10-31 300
2006-11-30 220
2006-12-31 415
2007-04-30 19
2007-05-31 77
2007-08-31 463
AA_2 2006-10-31 6630
2006-11-30 1980
2006-12-31 3367
2007-04-30 199
AB AB_1 2006-01-31 693
2006-05-31 2694
2007-09-30 6681
... ... ... ... ...
Z ZZ ZZ_9 2006-04-30 3680
2006-09-30 277
2007-03-31 1490
2007-09-30 289
2007-10-31 387
我想删除 level2
中那些在过去 6 个月内没有任何记录的人。假设组 A
的最大日期为 2007-12-31,那么我想删除 AA_2
因为它在过去 6 个月内没有任何记录。所需的输出将是这样的:
values
level0 level1 level2 date
A AA AA_1 2006-10-31 300
2006-11-30 220
2006-12-31 415
2007-04-30 19
2007-05-31 77
2007-08-31 463
AB AB_1 2006-01-31 693
2006-05-31 2694
2007-09-30 6681
... ... ... ... ...
Z ZZ ZZ_9 2006-04-30 3680
2006-09-30 277
2007-03-31 1490
2007-09-30 289
2007-10-31 387
我可以使用以下代码提取日期范围:
from dateutil.relativedelta import relativedelta
import pandas as pd
end_date = df.date.max()
start_date = end_date - relativedelta(months=+6 - 1)
test_period = pd.date_range(start=start_date, end=end_date, freq='1M').to_list()
[Timestamp('2007-07-31 00:00:00', freq='M'),
Timestamp('2007-08-31 00:00:00', freq='M'),
Timestamp('2007-09-30 00:00:00', freq='M'),
Timestamp('2007-10-31 00:00:00', freq='M'),
Timestamp('2007-11-30 00:00:00', freq='M'),
Timestamp('2007-12-31 00:00:00', freq='M')]
然而,由于每个 level0
组都有不同的最大日期(例如,一些收集信息到 2007-12-31 而一些 2007-11-30),上面的代码找到了最大日期整个数据集对某些组不正确。
我的问题是如何找到每个 level0
组中的最大日期,并删除那些在过去 6 个月内完全没有记录的?
提前致谢! (欢迎任何解决方案,尽管最需要快速的解决方案!)
首先通过 MultiIndex.to_frame
with DataFrame.add_suffix
for rename columns names and for first level get maximal values, subtract 6 months and compare if at least one value of column is greater in s
, last test if per first 3 levels is at least one True
in GroupBy.transform
with GroupBy.any
and filter in boolean indexing
创建辅助 DataFrame:
print (df)
values
level0 level1 level2 date
A AA AA_1 2007-12-31 300 <- date change
2006-11-30 220
2006-12-31 415
2007-04-30 19
2007-05-31 77
2007-08-31 463
AA_2 2006-10-31 6630
2006-11-30 1980
2006-12-31 3367
2007-04-30 199
AB AB_1 2006-01-31 693
2006-05-31 2694
2007-09-30 6681
df1 = df.index.to_frame().add_suffix('_')
s = df1['date_'].gt(df1.groupby('level0')['date_']
.transform('max')
.sub(pd.offsets.DateOffset(months=6)))
print (s)
level0 level1 level2 date
A AA AA_1 2007-12-31 True
2006-11-30 False
2006-12-31 False
2007-04-30 False
2007-05-31 False
2007-08-31 True
AA_2 2006-10-31 False
2006-11-30 False
2006-12-31 False
2007-04-30 False
AB AB_1 2006-01-31 False
2006-05-31 False
2007-09-30 True
Name: date_, dtype: bool
df = df[s.groupby(['level0','level1','level2']).transform('any')]
print (df)
values
level0 level1 level2 date
A AA AA_1 2007-12-31 300
2006-11-30 220
2006-12-31 415
2007-04-30 19
2007-05-31 77
2007-08-31 463
AB AB_1 2006-01-31 693
2006-05-31 2694
2007-09-30 6681
我有一个时间序列 DataFrame,它涉及具有 3 级层次结构(即 3 个 id 列)的多个组,以及一个日期列和一个值列。我编写的代码将它们分组,结果示例如下所示:
grp = df.groupby(['level0','level1','level2','date'])
values
level0 level1 level2 date
A AA AA_1 2006-10-31 300
2006-11-30 220
2006-12-31 415
2007-04-30 19
2007-05-31 77
2007-08-31 463
AA_2 2006-10-31 6630
2006-11-30 1980
2006-12-31 3367
2007-04-30 199
AB AB_1 2006-01-31 693
2006-05-31 2694
2007-09-30 6681
... ... ... ... ...
Z ZZ ZZ_9 2006-04-30 3680
2006-09-30 277
2007-03-31 1490
2007-09-30 289
2007-10-31 387
我想删除 level2
中那些在过去 6 个月内没有任何记录的人。假设组 A
的最大日期为 2007-12-31,那么我想删除 AA_2
因为它在过去 6 个月内没有任何记录。所需的输出将是这样的:
values
level0 level1 level2 date
A AA AA_1 2006-10-31 300
2006-11-30 220
2006-12-31 415
2007-04-30 19
2007-05-31 77
2007-08-31 463
AB AB_1 2006-01-31 693
2006-05-31 2694
2007-09-30 6681
... ... ... ... ...
Z ZZ ZZ_9 2006-04-30 3680
2006-09-30 277
2007-03-31 1490
2007-09-30 289
2007-10-31 387
我可以使用以下代码提取日期范围:
from dateutil.relativedelta import relativedelta
import pandas as pd
end_date = df.date.max()
start_date = end_date - relativedelta(months=+6 - 1)
test_period = pd.date_range(start=start_date, end=end_date, freq='1M').to_list()
[Timestamp('2007-07-31 00:00:00', freq='M'),
Timestamp('2007-08-31 00:00:00', freq='M'),
Timestamp('2007-09-30 00:00:00', freq='M'),
Timestamp('2007-10-31 00:00:00', freq='M'),
Timestamp('2007-11-30 00:00:00', freq='M'),
Timestamp('2007-12-31 00:00:00', freq='M')]
然而,由于每个 level0
组都有不同的最大日期(例如,一些收集信息到 2007-12-31 而一些 2007-11-30),上面的代码找到了最大日期整个数据集对某些组不正确。
我的问题是如何找到每个 level0
组中的最大日期,并删除那些在过去 6 个月内完全没有记录的?
提前致谢! (欢迎任何解决方案,尽管最需要快速的解决方案!)
首先通过 MultiIndex.to_frame
with DataFrame.add_suffix
for rename columns names and for first level get maximal values, subtract 6 months and compare if at least one value of column is greater in s
, last test if per first 3 levels is at least one True
in GroupBy.transform
with GroupBy.any
and filter in boolean indexing
创建辅助 DataFrame:
print (df)
values
level0 level1 level2 date
A AA AA_1 2007-12-31 300 <- date change
2006-11-30 220
2006-12-31 415
2007-04-30 19
2007-05-31 77
2007-08-31 463
AA_2 2006-10-31 6630
2006-11-30 1980
2006-12-31 3367
2007-04-30 199
AB AB_1 2006-01-31 693
2006-05-31 2694
2007-09-30 6681
df1 = df.index.to_frame().add_suffix('_')
s = df1['date_'].gt(df1.groupby('level0')['date_']
.transform('max')
.sub(pd.offsets.DateOffset(months=6)))
print (s)
level0 level1 level2 date
A AA AA_1 2007-12-31 True
2006-11-30 False
2006-12-31 False
2007-04-30 False
2007-05-31 False
2007-08-31 True
AA_2 2006-10-31 False
2006-11-30 False
2006-12-31 False
2007-04-30 False
AB AB_1 2006-01-31 False
2006-05-31 False
2007-09-30 True
Name: date_, dtype: bool
df = df[s.groupby(['level0','level1','level2']).transform('any')]
print (df)
values
level0 level1 level2 date
A AA AA_1 2007-12-31 300
2006-11-30 220
2006-12-31 415
2007-04-30 19
2007-05-31 77
2007-08-31 463
AB AB_1 2006-01-31 693
2006-05-31 2694
2007-09-30 6681