Pandas 面板数据 - Returns 滚动累积总和与年份差距
Pandas Panel Data - Returns rolling cumulative sum with year gaps
我目前正在处理 pandas 上的财务信息面板数据,我正在尝试滚动生成 3 年的累积异常 returns 列。不幸的是,我的数据有点参差不齐,因此对于同一家公司,我可能会有多年的差距。这意味着我不能简单地应用 .rolling(3).sum()
,因为我们有添加不属于彼此的年份的风险。只是为了给你一个想法,这是我的 df 的一个例子:
datadate fyear tic ab_ret
0 31/12/1998 1998 AAPL 0.045
1 31/12/1999 1999 AAPL 0.012
2 31/12/2002 2002 AAPL -0.031
3 31/12/2003 2003 AAPL -0.007
4 31/12/2004 2004 AAPL 0.056
5 31/12/2005 2005 AAPL 0.001
6 31/05/2008 2008 TSLA 0.034
7 31/05/2009 2009 TSLA 0.061
8 31/05/2010 2010 TSLA 0.003
9 31/05/2011 2011 TSLA -0.004
10 31/05/2014 2014 TSLA 0.009
.. ... .. .. ..
这是我想要的结果:
datadate fyear tic ab_ret cum_ab
0 31/12/1998 1998 AAPL 0.045 NaN
1 31/12/1999 1999 AAPL 0.012 NaN
2 31/12/2002 2002 AAPL -0.031 NaN
3 31/12/2003 2003 AAPL -0.007 NaN
4 31/12/2004 2004 AAPL 0.056 0.018
5 31/12/2005 2005 AAPL 0.001 0.050
6 31/05/2008 2008 TSLA 0.034 NaN
7 31/05/2009 2009 TSLA 0.061 NaN
8 31/05/2010 2010 TSLA 0.003 0.098
9 31/05/2011 2011 TSLA -0.004 0.060
10 31/05/2014 2014 TSLA 0.009 NaN
.. ... .. .. .. ..
我试过以下代码:
df['cum_ab'] = np.nan
mask = df.groupby('tic')['fyear'].apply(lambda x: x.shift(1)==x-1)
df.loc[mask,'cum_ab'] = df.groupby('tic')['ab_ret'].rolling(3).sum()
但不幸的是它似乎不起作用,因为我收到以下错误:ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long'
。
提前感谢您的帮助:)
import more_itertools as mit
s = """datadate,fyear,tic,ab_ret
31/12/1998,1998,AAPL,0.045
31/12/1999,1999,AAPL,0.012
31/12/1999,2000,AAPL,0.012
31/12/2002,2002,AAPL,-0.031
31/12/2003,2003,AAPL,-0.007
31/12/2005,2005,AAPL,0.001
31/12/2005,2007,AAPL,0.001
31/12/2005,2008,AAPL,0.001
31/12/2005,2009,AAPL,0.001
31/05/2008,2008,TSLA,0.034
31/05/2009,2009,TSLA,0.061
31/05/2010,2010,TSLA,0.003
31/05/2011,2011,TSLA,-0.004
31/05/2014,2014,TSLA,0.009"""
df = pd.read_csv(StringIO(s))
# create a groupby object
g = df.groupby('tic')['fyear']
# list comprehension to find consective groups
data = [{k: [list(gr) for gr in mit.consecutive_groups(v.values)]} for k,v in g]
# now find the group with the most consecutive years
m = [{k: list(filter(lambda x: len(x)>=3, v)) for k,v in x.items()} for x in data]
# iterate through list to create a dict
d = {}
[d.update(di) for di in m]
# create a dataframe from dict
df2 = pd.DataFrame(dict([(k,pd.Series(v)) for k,v in d.items()])).stack().reset_index(level=1).explode(0)
# create a mask and cumsum
mask = ~(df2[0].diff().bfill() == 1)
df2['gr'] = mask.cumsum().where(~mask).bfill().astype(int)
# merge two dataframes together
merge = df.merge(df2, left_on=['tic', 'fyear'], right_on=['level_1', 0])
# rolling
merge['cum_ab'] = merge.groupby(['tic', 'gr'])['ab_ret'].rolling(3).sum().reset_index(level=[0,1], drop=True)
# merge with the original df
final = df.merge(merge[['tic', 'fyear', 'cum_ab']], on=['tic', 'fyear'], how='left')
datadate fyear tic ab_ret cum_ab
0 31/12/1998 1998 AAPL 0.0 nan
1 31/12/1999 1999 AAPL 0.0 nan
2 31/12/1999 2000 AAPL 0.0 0.1
3 31/12/2002 2002 AAPL -0.0 nan
4 31/12/2003 2003 AAPL -0.0 nan
5 31/12/2005 2005 AAPL 0.0 nan
6 31/12/2005 2007 AAPL 0.0 nan
7 31/12/2005 2008 AAPL 0.0 nan
8 31/12/2005 2009 AAPL 0.0 0.0
9 31/05/2008 2008 TSLA 0.0 nan
10 31/05/2009 2009 TSLA 0.1 nan
11 31/05/2010 2010 TSLA 0.0 0.1
12 31/05/2011 2011 TSLA -0.0 0.1
13 31/05/2014 2014 TSLA 0.0 nan
我目前正在处理 pandas 上的财务信息面板数据,我正在尝试滚动生成 3 年的累积异常 returns 列。不幸的是,我的数据有点参差不齐,因此对于同一家公司,我可能会有多年的差距。这意味着我不能简单地应用 .rolling(3).sum()
,因为我们有添加不属于彼此的年份的风险。只是为了给你一个想法,这是我的 df 的一个例子:
datadate fyear tic ab_ret
0 31/12/1998 1998 AAPL 0.045
1 31/12/1999 1999 AAPL 0.012
2 31/12/2002 2002 AAPL -0.031
3 31/12/2003 2003 AAPL -0.007
4 31/12/2004 2004 AAPL 0.056
5 31/12/2005 2005 AAPL 0.001
6 31/05/2008 2008 TSLA 0.034
7 31/05/2009 2009 TSLA 0.061
8 31/05/2010 2010 TSLA 0.003
9 31/05/2011 2011 TSLA -0.004
10 31/05/2014 2014 TSLA 0.009
.. ... .. .. ..
这是我想要的结果:
datadate fyear tic ab_ret cum_ab
0 31/12/1998 1998 AAPL 0.045 NaN
1 31/12/1999 1999 AAPL 0.012 NaN
2 31/12/2002 2002 AAPL -0.031 NaN
3 31/12/2003 2003 AAPL -0.007 NaN
4 31/12/2004 2004 AAPL 0.056 0.018
5 31/12/2005 2005 AAPL 0.001 0.050
6 31/05/2008 2008 TSLA 0.034 NaN
7 31/05/2009 2009 TSLA 0.061 NaN
8 31/05/2010 2010 TSLA 0.003 0.098
9 31/05/2011 2011 TSLA -0.004 0.060
10 31/05/2014 2014 TSLA 0.009 NaN
.. ... .. .. .. ..
我试过以下代码:
df['cum_ab'] = np.nan
mask = df.groupby('tic')['fyear'].apply(lambda x: x.shift(1)==x-1)
df.loc[mask,'cum_ab'] = df.groupby('tic')['ab_ret'].rolling(3).sum()
但不幸的是它似乎不起作用,因为我收到以下错误:ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long'
。
提前感谢您的帮助:)
import more_itertools as mit
s = """datadate,fyear,tic,ab_ret
31/12/1998,1998,AAPL,0.045
31/12/1999,1999,AAPL,0.012
31/12/1999,2000,AAPL,0.012
31/12/2002,2002,AAPL,-0.031
31/12/2003,2003,AAPL,-0.007
31/12/2005,2005,AAPL,0.001
31/12/2005,2007,AAPL,0.001
31/12/2005,2008,AAPL,0.001
31/12/2005,2009,AAPL,0.001
31/05/2008,2008,TSLA,0.034
31/05/2009,2009,TSLA,0.061
31/05/2010,2010,TSLA,0.003
31/05/2011,2011,TSLA,-0.004
31/05/2014,2014,TSLA,0.009"""
df = pd.read_csv(StringIO(s))
# create a groupby object
g = df.groupby('tic')['fyear']
# list comprehension to find consective groups
data = [{k: [list(gr) for gr in mit.consecutive_groups(v.values)]} for k,v in g]
# now find the group with the most consecutive years
m = [{k: list(filter(lambda x: len(x)>=3, v)) for k,v in x.items()} for x in data]
# iterate through list to create a dict
d = {}
[d.update(di) for di in m]
# create a dataframe from dict
df2 = pd.DataFrame(dict([(k,pd.Series(v)) for k,v in d.items()])).stack().reset_index(level=1).explode(0)
# create a mask and cumsum
mask = ~(df2[0].diff().bfill() == 1)
df2['gr'] = mask.cumsum().where(~mask).bfill().astype(int)
# merge two dataframes together
merge = df.merge(df2, left_on=['tic', 'fyear'], right_on=['level_1', 0])
# rolling
merge['cum_ab'] = merge.groupby(['tic', 'gr'])['ab_ret'].rolling(3).sum().reset_index(level=[0,1], drop=True)
# merge with the original df
final = df.merge(merge[['tic', 'fyear', 'cum_ab']], on=['tic', 'fyear'], how='left')
datadate fyear tic ab_ret cum_ab
0 31/12/1998 1998 AAPL 0.0 nan
1 31/12/1999 1999 AAPL 0.0 nan
2 31/12/1999 2000 AAPL 0.0 0.1
3 31/12/2002 2002 AAPL -0.0 nan
4 31/12/2003 2003 AAPL -0.0 nan
5 31/12/2005 2005 AAPL 0.0 nan
6 31/12/2005 2007 AAPL 0.0 nan
7 31/12/2005 2008 AAPL 0.0 nan
8 31/12/2005 2009 AAPL 0.0 0.0
9 31/05/2008 2008 TSLA 0.0 nan
10 31/05/2009 2009 TSLA 0.1 nan
11 31/05/2010 2010 TSLA 0.0 0.1
12 31/05/2011 2011 TSLA -0.0 0.1
13 31/05/2014 2014 TSLA 0.0 nan