Pandas 面板数据 - Returns 滚动累积总和与年份差距

Question

我目前正在处理 pandas 上的财务信息面板数据，我正在尝试滚动生成 3 年的累积异常 returns 列。不幸的是，我的数据有点参差不齐，因此对于同一家公司，我可能会有多年的差距。这意味着我不能简单地应用 .rolling(3).sum()，因为我们有添加不属于彼此的年份的风险。只是为了给你一个想法，这是我的 df 的一个例子：

       datadate    fyear       tic   ab_ret   
0    31/12/1998     1998      AAPL    0.045  
1    31/12/1999     1999      AAPL    0.012   
2    31/12/2002     2002      AAPL   -0.031   
3    31/12/2003     2003      AAPL   -0.007   
4    31/12/2004     2004      AAPL    0.056
5    31/12/2005     2005      AAPL    0.001   
6    31/05/2008     2008      TSLA    0.034    
7    31/05/2009     2009      TSLA    0.061    
8    31/05/2010     2010      TSLA    0.003    
9    31/05/2011     2011      TSLA   -0.004    
10   31/05/2014     2014      TSLA    0.009  
..      ...         ..         ..      ..

这是我想要的结果：

       datadate    fyear       tic    ab_ret   cum_ab
0    31/12/1998     1998      AAPL    0.045      NaN
1    31/12/1999     1999      AAPL    0.012      NaN
2    31/12/2002     2002      AAPL   -0.031      NaN
3    31/12/2003     2003      AAPL   -0.007      NaN
4    31/12/2004     2004      AAPL    0.056    0.018
5    31/12/2005     2005      AAPL    0.001    0.050
6    31/05/2008     2008      TSLA    0.034      NaN    
7    31/05/2009     2009      TSLA    0.061      NaN
8    31/05/2010     2010      TSLA    0.003    0.098
9    31/05/2011     2011      TSLA   -0.004    0.060
10   31/05/2014     2014      TSLA    0.009      NaN
..      ...         ..         ..      ..       ..

我试过以下代码：

df['cum_ab'] = np.nan
mask = df.groupby('tic')['fyear'].apply(lambda x: x.shift(1)==x-1)
df.loc[mask,'cum_ab'] = df.groupby('tic')['ab_ret'].rolling(3).sum()

但不幸的是它似乎不起作用，因为我收到以下错误：ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long'。

提前感谢您的帮助:)

Answer 1

import more_itertools as mit

s = """datadate,fyear,tic,ab_ret
31/12/1998,1998,AAPL,0.045
31/12/1999,1999,AAPL,0.012
31/12/1999,2000,AAPL,0.012
31/12/2002,2002,AAPL,-0.031
31/12/2003,2003,AAPL,-0.007
31/12/2005,2005,AAPL,0.001
31/12/2005,2007,AAPL,0.001
31/12/2005,2008,AAPL,0.001
31/12/2005,2009,AAPL,0.001
31/05/2008,2008,TSLA,0.034
31/05/2009,2009,TSLA,0.061
31/05/2010,2010,TSLA,0.003
31/05/2011,2011,TSLA,-0.004
31/05/2014,2014,TSLA,0.009"""

df = pd.read_csv(StringIO(s))

# create a groupby object
g = df.groupby('tic')['fyear']
# list comprehension to find consective groups
data = [{k: [list(gr) for gr in mit.consecutive_groups(v.values)]} for k,v in g]
# now find the group with the most consecutive years
m = [{k: list(filter(lambda x: len(x)>=3, v)) for k,v in x.items()} for x in data]
# iterate through list to create a dict
d = {}
[d.update(di) for di in m]
# create a dataframe from dict
df2 = pd.DataFrame(dict([(k,pd.Series(v)) for k,v in d.items()])).stack().reset_index(level=1).explode(0)
# create a mask and cumsum
mask = ~(df2[0].diff().bfill() == 1)
df2['gr'] = mask.cumsum().where(~mask).bfill().astype(int)
# merge two dataframes together
merge = df.merge(df2, left_on=['tic', 'fyear'], right_on=['level_1', 0])
# rolling
merge['cum_ab'] = merge.groupby(['tic', 'gr'])['ab_ret'].rolling(3).sum().reset_index(level=[0,1], drop=True)
# merge with the original df
final = df.merge(merge[['tic', 'fyear', 'cum_ab']], on=['tic', 'fyear'], how='left')

      datadate fyear   tic  ab_ret  cum_ab
0   31/12/1998  1998  AAPL     0.0     nan
1   31/12/1999  1999  AAPL     0.0     nan
2   31/12/1999  2000  AAPL     0.0     0.1
3   31/12/2002  2002  AAPL    -0.0     nan
4   31/12/2003  2003  AAPL    -0.0     nan
5   31/12/2005  2005  AAPL     0.0     nan
6   31/12/2005  2007  AAPL     0.0     nan
7   31/12/2005  2008  AAPL     0.0     nan
8   31/12/2005  2009  AAPL     0.0     0.0
9   31/05/2008  2008  TSLA     0.0     nan
10  31/05/2009  2009  TSLA     0.1     nan
11  31/05/2010  2010  TSLA     0.0     0.1
12  31/05/2011  2011  TSLA    -0.0     0.1
13  31/05/2014  2014  TSLA     0.0     nan

Pandas 面板数据 - Returns 滚动累积总和与年份差距

Pandas Panel Data - Returns rolling cumulative sum with year gaps

python

finance

numpy

pandas

panel-data