Pandas - 与 DatetimeIndex 交叉引用 - Groupby

Pandas - Cross referencing with DatetimeIndex - Groupby

我有很多公司按月(月末)的数据。我想为 each company 创建一个包含 groupby 的新列,其中:

您可以在此处下载示例数据:https://www.dropbox.com/s/oz1ltblh6u0chzt/tem_20220506.csv?dl=0

我一直在尝试使用此代码行但未成功:

df = pd.read_csv('tem_20220506.csv', parse_dates=['date'])
df.drop(columns=['new_col', 'Note'], inplace=True)
df = df.set_index('date').rename_axis(None)

df['new_col'] = df.groupby('comp').apply(lambda g: --- ) # ← I am now stuck here

期望输出:

           comp  old_col  new_col  \
2000-01-31    a        1      NaN   
2000-02-29    a        2      NaN   
2000-03-31    a        3      NaN   
2000-04-30    a        4      NaN   
2000-05-31    a        5      NaN   
2000-06-30    a        6      NaN   
2000-07-31    a        7      NaN   
2000-08-31    a        8      NaN   
2000-09-30    a        9      NaN   
2000-10-31    a       10      NaN   
2000-11-30    a       11      NaN   
2000-12-31    a       12      NaN   
2001-01-31    a       13      NaN   
2001-02-28    a       14      NaN   
2001-03-31    a       15      NaN   
2001-04-30    a       16      NaN   
2001-05-31    a       17      NaN   
2001-06-30    a       18      NaN   
2001-07-31    a       19   12.000   
2001-08-31    a       20   12.000   
2001-09-30    a       21   12.000   
2001-10-31    a       22   12.000   
2001-11-30    a       23   12.000   
2001-12-31    a       24   12.000   
2002-01-31    a       25   12.000   
2002-02-28    a       26   12.000   
2002-03-31    a       27   12.000   
2002-04-30    a       28   12.000   
2002-05-31    a       29   12.000   
2002-06-30    a       30   12.000   
2002-07-31    a       31   24.000   
2002-08-31    a       32   24.000   
2002-09-30    a       33   24.000   
2002-10-31    a       34   24.000   
2002-11-30    a       35   24.000   
2002-12-31    a       36   24.000   
2000-01-31    b      101      NaN   
2000-02-29    b      102      NaN   
2000-03-31    b      103      NaN   
2000-04-30    b      104      NaN   
2000-05-31    b      105      NaN   
2000-06-30    b      106      NaN   
2000-07-31    b      107      NaN   
2000-08-31    b      108      NaN   
2000-09-30    b      109      NaN   
2000-10-31    b      110      NaN   
2000-11-30    b      111      NaN   
2001-01-31    b      113      NaN   
2001-02-28    b      114      NaN   
2001-03-31    b      115      NaN   
2001-04-30    b      116      NaN   
2001-05-31    b      117      NaN   
2001-06-30    b      118      NaN   
2001-07-31    b      119      NaN   
2001-08-31    b      120      NaN   
2001-09-30    b      121      NaN   
2001-10-31    b      122      NaN   
2001-11-30    b      123      NaN   
2001-12-31    b      124      NaN   
2002-01-31    b      125      NaN   
2002-02-28    b      126      NaN   
2002-03-31    b      127      NaN   
2002-04-30    b      128      NaN   
2002-05-31    b      129      NaN   
2002-06-30    b      130      NaN   
2002-07-31    b      131  124.000   
2002-08-31    b      132  124.000   
2002-10-31    b      134  124.000   
2002-11-30    b      135  124.000   
2002-12-31    b      136  124.000   

(!!) 注意: for comp==b:

df = pd.read_csv('tem_20220506.csv', parse_dates=['date'])
df.drop(columns=['new_col', 'Note'], inplace=True)
df.set_index('date', inplace=True)

使用辅助函数根据旧列获取新列

def helper_func(x):
    # get the date values corresponding to month = 12
    req_values = x[x.index.month == 12].to_dict()['old_col']

    # iterate over those dates and replace July to June range depending on the year of the date
    for date_value, old_col_value in req_values.items():
        x.loc[f'{date_value.year+1}-07-31':f'{date_value.year+2}-06-30', 'new_col'] = old_col_value
        
    return x
df['new_col'] = df.groupby('comp')[['old_col']].apply(helper_func)['new_col']

这将提供数据框作为您想要的输出

备用辅助函数

def helper_fun2(x):
    """
    1. iterate over years
    2. update July to June next two years value, using DEC value of current year
    """
    for year in x.index.year.unique():
        if f'{year}-12-31' in x.index:
            x.loc[f'{year+1}-07-31':f'{year+2}-06-30', 'new_col'] = x.loc[f'{year}-12-31']['old_col']
        
    return x