Pandas 面板数据 - Returns 滚动累积总和与年份差距

Pandas Panel Data - Returns rolling cumulative sum with year gaps

我目前正在处理 pandas 上的财务信息面板数据,我正在尝试滚动生成 3 年的累积异常 returns 列。不幸的是,我的数据有点参差不齐,因此对于同一家公司,我可能会有多年的差距。这意味着我不能简单地应用 .rolling(3).sum(),因为我们有添加不属于彼此的年份的风险。只是为了给你一个想法,这是我的 df 的一个例子:

       datadate    fyear       tic   ab_ret   
0    31/12/1998     1998      AAPL    0.045  
1    31/12/1999     1999      AAPL    0.012   
2    31/12/2002     2002      AAPL   -0.031   
3    31/12/2003     2003      AAPL   -0.007   
4    31/12/2004     2004      AAPL    0.056
5    31/12/2005     2005      AAPL    0.001   
6    31/05/2008     2008      TSLA    0.034    
7    31/05/2009     2009      TSLA    0.061    
8    31/05/2010     2010      TSLA    0.003    
9    31/05/2011     2011      TSLA   -0.004    
10   31/05/2014     2014      TSLA    0.009  
..      ...         ..         ..      ..      


       datadate    fyear       tic    ab_ret   cum_ab
0    31/12/1998     1998      AAPL    0.045      NaN
1    31/12/1999     1999      AAPL    0.012      NaN
2    31/12/2002     2002      AAPL   -0.031      NaN
3    31/12/2003     2003      AAPL   -0.007      NaN
4    31/12/2004     2004      AAPL    0.056    0.018
5    31/12/2005     2005      AAPL    0.001    0.050
6    31/05/2008     2008      TSLA    0.034      NaN    
7    31/05/2009     2009      TSLA    0.061      NaN
8    31/05/2010     2010      TSLA    0.003    0.098
9    31/05/2011     2011      TSLA   -0.004    0.060
10   31/05/2014     2014      TSLA    0.009      NaN
..      ...         ..         ..      ..       ..


df['cum_ab'] = np.nan
mask = df.groupby('tic')['fyear'].apply(lambda x: x.shift(1)==x-1)
df.loc[mask,'cum_ab'] = df.groupby('tic')['ab_ret'].rolling(3).sum()

但不幸的是它似乎不起作用,因为我收到以下错误:ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long'


import more_itertools as mit

s = """datadate,fyear,tic,ab_ret

df = pd.read_csv(StringIO(s))

# create a groupby object
g = df.groupby('tic')['fyear']
# list comprehension to find consective groups
data = [{k: [list(gr) for gr in mit.consecutive_groups(v.values)]} for k,v in g]
# now find the group with the most consecutive years
m = [{k: list(filter(lambda x: len(x)>=3, v)) for k,v in x.items()} for x in data]
# iterate through list to create a dict
d = {}
[d.update(di) for di in m]
# create a dataframe from dict
df2 = pd.DataFrame(dict([(k,pd.Series(v)) for k,v in d.items()])).stack().reset_index(level=1).explode(0)
# create a mask and cumsum
mask = ~(df2[0].diff().bfill() == 1)
df2['gr'] = mask.cumsum().where(~mask).bfill().astype(int)
# merge two dataframes together
merge = df.merge(df2, left_on=['tic', 'fyear'], right_on=['level_1', 0])
# rolling
merge['cum_ab'] = merge.groupby(['tic', 'gr'])['ab_ret'].rolling(3).sum().reset_index(level=[0,1], drop=True)
# merge with the original df
final = df.merge(merge[['tic', 'fyear', 'cum_ab']], on=['tic', 'fyear'], how='left')

      datadate fyear   tic  ab_ret  cum_ab
0   31/12/1998  1998  AAPL     0.0     nan
1   31/12/1999  1999  AAPL     0.0     nan
2   31/12/1999  2000  AAPL     0.0     0.1
3   31/12/2002  2002  AAPL    -0.0     nan
4   31/12/2003  2003  AAPL    -0.0     nan
5   31/12/2005  2005  AAPL     0.0     nan
6   31/12/2005  2007  AAPL     0.0     nan
7   31/12/2005  2008  AAPL     0.0     nan
8   31/12/2005  2009  AAPL     0.0     0.0
9   31/05/2008  2008  TSLA     0.0     nan
10  31/05/2009  2009  TSLA     0.1     nan
11  31/05/2010  2010  TSLA     0.0     0.1
12  31/05/2011  2011  TSLA    -0.0     0.1
13  31/05/2014  2014  TSLA     0.0     nan