Python 根据条件删除 groupby 对象中的子组

Python remove subgroups in a groupby object based on condition

我有一个时间序列 DataFrame,它涉及具有 3 级层次结构(即 3 个 id 列)的多个组,以及一个日期列和一个值列。我编写的代码将它们分组,结果示例如下所示:

grp = df.groupby(['level0','level1','level2','date'])

                                        values          
level0  level1  level2      date        
A       AA      AA_1        2006-10-31  300
                            2006-11-30  220
                            2006-12-31  415
                            2007-04-30  19
                            2007-05-31  77
                            2007-08-31  463
                AA_2        2006-10-31  6630
                            2006-11-30  1980
                            2006-12-31  3367
                            2007-04-30  199
        AB      AB_1        2006-01-31  693
                            2006-05-31  2694
                            2007-09-30  6681 
...     ...     ...         ...         ...
Z       ZZ      ZZ_9        2006-04-30  3680
                            2006-09-30  277
                            2007-03-31  1490
                            2007-09-30  289
                            2007-10-31  387

我想删除 level2 中那些在过去 6 个月内没有任何记录的人。假设组 A 的最大日期为 2007-12-31,那么我想删除 AA_2 因为它在过去 6 个月内没有任何记录。所需的输出将是这样的:

                                        values
level0  level1  level2      date        
A       AA      AA_1        2006-10-31  300
                            2006-11-30  220
                            2006-12-31  415
                            2007-04-30  19
                            2007-05-31  77
                            2007-08-31  463
        AB      AB_1        2006-01-31  693
                            2006-05-31  2694
                            2007-09-30  6681 
...     ...     ...         ...         ...
Z       ZZ      ZZ_9        2006-04-30  3680
                            2006-09-30  277
                            2007-03-31  1490
                            2007-09-30  289
                            2007-10-31  387

我可以使用以下代码提取日期范围:

from dateutil.relativedelta import relativedelta
import pandas as pd

end_date = df.date.max()
start_date = end_date - relativedelta(months=+6 - 1)
test_period = pd.date_range(start=start_date, end=end_date, freq='1M').to_list()

[Timestamp('2007-07-31 00:00:00', freq='M'),
 Timestamp('2007-08-31 00:00:00', freq='M'),
 Timestamp('2007-09-30 00:00:00', freq='M'),
 Timestamp('2007-10-31 00:00:00', freq='M'),
 Timestamp('2007-11-30 00:00:00', freq='M'),
 Timestamp('2007-12-31 00:00:00', freq='M')]

然而,由于每个 level0 组都有不同的最大日期(例如,一些收集信息到 2007-12-31 而一些 2007-11-30),上面的代码找到了最大日期整个数据集对某些组不正确。

我的问题是如何找到每个 level0 组中的最大日期,并删除那些在过去 6 个月内完全没有记录的?

提前致谢! (欢迎任何解决方案,尽管最需要快速的解决方案!)

首先通过 MultiIndex.to_frame with DataFrame.add_suffix for rename columns names and for first level get maximal values, subtract 6 months and compare if at least one value of column is greater in s, last test if per first 3 levels is at least one True in GroupBy.transform with GroupBy.any and filter in boolean indexing 创建辅助 DataFrame:

print (df)
                                 values
level0 level1 level2 date              
A      AA     AA_1   2007-12-31     300 <- date change
                     2006-11-30     220
                     2006-12-31     415
                     2007-04-30      19
                     2007-05-31      77
                     2007-08-31     463
              AA_2   2006-10-31    6630
                     2006-11-30    1980
                     2006-12-31    3367
                     2007-04-30     199
       AB     AB_1   2006-01-31     693
                     2006-05-31    2694
                     2007-09-30    6681

df1 = df.index.to_frame().add_suffix('_')


s = df1['date_'].gt(df1.groupby('level0')['date_']
                      .transform('max')
                      .sub(pd.offsets.DateOffset(months=6)))
print (s)
level0  level1  level2  date      
A       AA      AA_1    2007-12-31     True
                        2006-11-30    False
                        2006-12-31    False
                        2007-04-30    False
                        2007-05-31    False
                        2007-08-31     True
                AA_2    2006-10-31    False
                        2006-11-30    False
                        2006-12-31    False
                        2007-04-30    False
        AB      AB_1    2006-01-31    False
                        2006-05-31    False
                        2007-09-30     True
Name: date_, dtype: bool

df = df[s.groupby(['level0','level1','level2']).transform('any')]

print (df)
                                 values
level0 level1 level2 date              
A      AA     AA_1   2007-12-31     300
                     2006-11-30     220
                     2006-12-31     415
                     2007-04-30      19
                     2007-05-31      77
                     2007-08-31     463
       AB     AB_1   2006-01-31     693
                     2006-05-31    2694
                     2007-09-30    6681