根据完整的过去合并和计算移动平均线 window
Merge and compute moving average based on full past and also window
我有 2 个数据框,如下所示
df = pd.DataFrame(
{'stud_name' : ['ABC', 'ABC','ABC','ABC',
'DEF'],
'ques_date' : ['13/11/2020', '10/1/2018','11/11/2017', '27/03/2016',
'13/05/2010']})
df_score = pd.DataFrame(
{'stud_name' : ['ABC', 'ABC','ABC','ABC','ABC','ABC','ABC','DEF','DEF','DEF','DEF'],
'qtr':['Q1','Q2','Q3','Q4','Q1','Q2','Q3','Q3','Q4','Q2','Q4'],
'year' : [2015,2015,2015,2015,2016,2017,2017,2017,2017,2018,2017],
't_score':[11,13,15,17,12,312,14,15,18,43,32],
'p_score':[32,45,32,21,56,87,32,786,213,32,11]})
我想执行以下操作
a) 对于每个 stud_name
,计算两个(t_score)列的移动平均值作为输出
mov_avg_full
= 使用stud_name
的所有过去数据。 (所有过去季度的信息来自 df_score)
mov_avg_2qtr
= 使用过去 2 个季度的数据(仅来自 df_score 的过去 2 个季度的信息)
例如:如果年份是 2020 年并且是第三季度,我想计算所有过去数据(2020 年第三季度之前)的移动平均值和最后两个季度(2020 年第一季度和 2020 年第二季度)的移动平均值
如果特定 stud_name
没有过去的数据,我们就输入 NA
(例如:DEF 在 df_score 中没有过去的数据)
我尝试了以下
df['ques_date'] = pd.to_datetime(df['ques_date'], dayfirst=True)
df.sort_values(by=['stud_name','ques_date'],inplace=True)
df['act_qtr'] = df['ques_date'].dt.to_period('Q').dt.strftime('Q%q')
df['year'] = df['ques_date'].dt.year
df_score.sort_values(by=['year','qtr'],inplace=True)
df_full = df.merge(df_score,on=['stud_name'])
df_full['mov_avg_2qtr'] = df_full['t_score'].rolling(2).mean() # this is incorrect
我希望我的输出如下所示
您可能想要使用 rolling
和 expanding
方法。获取季度指数的笛卡尔积后,可以应用日期掩码来获取目标行。
代码:
import pandas as pd
# Create sample dataframes
df = pd.DataFrame({'stud_name': ['ABC', 'ABC','ABC','ABC', 'DEF'], 'ques_date' : ['13/11/2020', '10/1/2018','11/11/2017', '27/03/2016', '13/05/2010']})
df_score = pd.DataFrame({'stud_name': ['ABC', 'ABC','ABC','ABC','ABC','ABC','ABC','DEF','DEF','DEF','DEF'], 'qtr':['Q1','Q2','Q3','Q4','Q1','Q2','Q3','Q3','Q4','Q2','Q4'], 'year' : [2015,2015,2015,2015,2016,2017,2017,2017,2017,2018,2017], 't_score':[11,13,15,17,12,312,14,15,18,43,32], 'p_score':[32,45,32,21,56,87,32,786,213,32,11]})
# Assign necessary datetime objects
df['ques_date'] = pd.to_datetime(df.ques_date, format='%d/%m/%Y')
df[['act_qtr', 'act_year', 'act_key']] = df['ques_date'].map(lambda e: [f'Q{e.quarter}', e.year, e.to_period('Q')]).apply(pd.Series)
df_score['key'] = df_score.year.astype(str) + df_score.qtr
# Calculate the two kinds of the moving average
df_score.sort_values(['year', 'qtr'], inplace=True)
df_score['mov_avg_full'] = df_score.groupby('stud_name')['t_score'].expanding().mean().values
df_score['mov_avg_2qtr'] = df_score.groupby('stud_name')['t_score'].rolling(2).mean().values
# Get a cross-joined dataframe
df_full = df.merge(df_score, on='stud_name').sort_values(['act_key', 'key'])
# Apply a datetime mask
df_full = df_full[df_full.key < df_full.act_key].groupby(['stud_name', 'act_qtr', 'act_year'], as_index=False).last()
# Deal with the missing null values and use necessary columns
df_full = df.merge(df_full, how='left', on=['stud_name', 'ques_date', 'act_qtr', 'act_year'])
df_full = df_full[['stud_name', 'ques_date', 'act_qtr', 'act_year', 'mov_avg_full', 'mov_avg_2qtr']]
print(df_full)
输出:
stud_name
ques_date
act_qtr
act_year
mov_avg_full
mov_avg_2qtr
ABC
2020-11-13 00:00:00
Q4
2020
56.2857
163
ABC
2018-01-10 00:00:00
Q1
2018
56.2857
163
ABC
2017-11-11 00:00:00
Q4
2017
56.2857
163
ABC
2016-03-27 00:00:00
Q1
2016
14
16
DEF
2010-05-13 00:00:00
Q2
2010
nan
nan
我有 2 个数据框,如下所示
df = pd.DataFrame(
{'stud_name' : ['ABC', 'ABC','ABC','ABC',
'DEF'],
'ques_date' : ['13/11/2020', '10/1/2018','11/11/2017', '27/03/2016',
'13/05/2010']})
df_score = pd.DataFrame(
{'stud_name' : ['ABC', 'ABC','ABC','ABC','ABC','ABC','ABC','DEF','DEF','DEF','DEF'],
'qtr':['Q1','Q2','Q3','Q4','Q1','Q2','Q3','Q3','Q4','Q2','Q4'],
'year' : [2015,2015,2015,2015,2016,2017,2017,2017,2017,2018,2017],
't_score':[11,13,15,17,12,312,14,15,18,43,32],
'p_score':[32,45,32,21,56,87,32,786,213,32,11]})
我想执行以下操作
a) 对于每个 stud_name
,计算两个(t_score)列的移动平均值作为输出
mov_avg_full
= 使用stud_name
的所有过去数据。 (所有过去季度的信息来自 df_score)
mov_avg_2qtr
= 使用过去 2 个季度的数据(仅来自 df_score 的过去 2 个季度的信息)
例如:如果年份是 2020 年并且是第三季度,我想计算所有过去数据(2020 年第三季度之前)的移动平均值和最后两个季度(2020 年第一季度和 2020 年第二季度)的移动平均值
如果特定 stud_name
没有过去的数据,我们就输入 NA
(例如:DEF 在 df_score 中没有过去的数据)
我尝试了以下
df['ques_date'] = pd.to_datetime(df['ques_date'], dayfirst=True)
df.sort_values(by=['stud_name','ques_date'],inplace=True)
df['act_qtr'] = df['ques_date'].dt.to_period('Q').dt.strftime('Q%q')
df['year'] = df['ques_date'].dt.year
df_score.sort_values(by=['year','qtr'],inplace=True)
df_full = df.merge(df_score,on=['stud_name'])
df_full['mov_avg_2qtr'] = df_full['t_score'].rolling(2).mean() # this is incorrect
我希望我的输出如下所示
您可能想要使用 rolling
和 expanding
方法。获取季度指数的笛卡尔积后,可以应用日期掩码来获取目标行。
代码:
import pandas as pd
# Create sample dataframes
df = pd.DataFrame({'stud_name': ['ABC', 'ABC','ABC','ABC', 'DEF'], 'ques_date' : ['13/11/2020', '10/1/2018','11/11/2017', '27/03/2016', '13/05/2010']})
df_score = pd.DataFrame({'stud_name': ['ABC', 'ABC','ABC','ABC','ABC','ABC','ABC','DEF','DEF','DEF','DEF'], 'qtr':['Q1','Q2','Q3','Q4','Q1','Q2','Q3','Q3','Q4','Q2','Q4'], 'year' : [2015,2015,2015,2015,2016,2017,2017,2017,2017,2018,2017], 't_score':[11,13,15,17,12,312,14,15,18,43,32], 'p_score':[32,45,32,21,56,87,32,786,213,32,11]})
# Assign necessary datetime objects
df['ques_date'] = pd.to_datetime(df.ques_date, format='%d/%m/%Y')
df[['act_qtr', 'act_year', 'act_key']] = df['ques_date'].map(lambda e: [f'Q{e.quarter}', e.year, e.to_period('Q')]).apply(pd.Series)
df_score['key'] = df_score.year.astype(str) + df_score.qtr
# Calculate the two kinds of the moving average
df_score.sort_values(['year', 'qtr'], inplace=True)
df_score['mov_avg_full'] = df_score.groupby('stud_name')['t_score'].expanding().mean().values
df_score['mov_avg_2qtr'] = df_score.groupby('stud_name')['t_score'].rolling(2).mean().values
# Get a cross-joined dataframe
df_full = df.merge(df_score, on='stud_name').sort_values(['act_key', 'key'])
# Apply a datetime mask
df_full = df_full[df_full.key < df_full.act_key].groupby(['stud_name', 'act_qtr', 'act_year'], as_index=False).last()
# Deal with the missing null values and use necessary columns
df_full = df.merge(df_full, how='left', on=['stud_name', 'ques_date', 'act_qtr', 'act_year'])
df_full = df_full[['stud_name', 'ques_date', 'act_qtr', 'act_year', 'mov_avg_full', 'mov_avg_2qtr']]
print(df_full)
输出:
stud_name | ques_date | act_qtr | act_year | mov_avg_full | mov_avg_2qtr |
---|---|---|---|---|---|
ABC | 2020-11-13 00:00:00 | Q4 | 2020 | 56.2857 | 163 |
ABC | 2018-01-10 00:00:00 | Q1 | 2018 | 56.2857 | 163 |
ABC | 2017-11-11 00:00:00 | Q4 | 2017 | 56.2857 | 163 |
ABC | 2016-03-27 00:00:00 | Q1 | 2016 | 14 | 16 |
DEF | 2010-05-13 00:00:00 | Q2 | 2010 | nan | nan |