Pandas:使用 Groupby 的滚动时间加权移动平均线
Pandas: Rolling time-weighted moving average with Groupby
我有以下客户销售历史数据框(这只是其中的一部分,实际数据框超过 70k 行):
import pandas as pd
import datetime as DT
df_test = pd.DataFrame({
'Cus_ID': ["T313","T348","T313","T348","T313","T348","T329","T329","T348","T313","T329","T348"],
'Value': [3,2,3,4,5,3,7.25,10.25,4.5,11.75,6.25,6],
'Date' : [
DT.datetime(2015,10,18),
DT.datetime(2015,11,14),
DT.datetime(2015,11,18),
DT.datetime(2015,12,13),
DT.datetime(2015,12,19),
DT.datetime(2016,1,24),
DT.datetime(2016,1,31),
DT.datetime(2016,2,17),
DT.datetime(2016,3,28),
DT.datetime(2016,3,31),
DT.datetime(2016,4,3),
DT.datetime(2016,4,16),
]})
我想向数据框添加一个新列以显示该客户过去 90 天的时间加权平均值结果。
预期结果(第 Value_Result
列):
Cus_ID Date Value Value_Result
0 T313 2015-10-18 3.00 NaN (No 90days history)
1 T348 2015-11-14 2.00 NaN (No 90days history)
2 T313 2015-11-18 3.00 3 (3*31)/31
3 T348 2015-12-13 4.00 2 (2*29)/29
4 T313 2015-12-19 5.00 3 (3*62+3*31)/(62+31)
5 T348 2016-01-24 3.00 2.743 (4*42+2*71)/(42+71)
6 T329 2016-01-31 7.25 NaN (No 90days history)
7 T329 2016-02-17 10.25 7.25 (7.25*17)/17
8 T348 2016-03-28 4.50 3 (3*64)/64
9 T313 2016-03-31 11.75 NaN (No 90days history)
10 T329 2016-04-03 6.25 8.516 (10.25*46+7.25*63)/(46+63)
11 T348 2016-04-16 6.00 3.279 (4.5*19+3*83)/(19+83)
我试过使用 groupby('Cus_ID')
和滚动应用,但我很难编写仅考虑向后 90 天的函数。
非常感谢任何意见。
我不确定滚动函数是否适合加权平均值,尽管也许其他人知道如何使用它
我不能保证这将是最优化的方法,但它会产生您想要的结果,您可以采用它并在必要时在此基础上进行构建。
非常感谢这个pbpython article。我建议通读它。
我的方法是创建一个将应用于组的函数(按 Cus_ID 分组)。此函数将遍历该组中的行并按照您上面的描述进行加权平均,将其应用回组和 return 组。为了解释清楚起见,此代码片段很冗长,如果需要,您可以通过删除所有创建的变量来 trim 简化它。
应用函数如下所示
def tw_avg(group, value_col, time_col, new_col_name="time_weighted_average", days_back='-90 days', fill_value=np.nan):
"""
Will calculate the weighted (by day) time average of the group passed.
It will not operate on the day it is evaulating but the previous days_back.
Should be used with the apply() function in Pandas with groupby function
Args:
group (pandas.DataFrame): Will be passed by pandas
value_col (str): Name of column with value to be averaged by weight
time_col (str): Name of column of with times in them
new_col_name (str): Name of new column to place time weighted average into, default: time_weighted_average
days_back (str): Time delta description as described in panda time deltas documentation, default: -90 days
fill_value (any): The value to fill rows which do not have data in days_back period, default: np.nan
Returns:
(pandas.DataFrame): The modified DataFrame with time weighted average added to columns, np.nan if no
time weight average exist
"""
for idx, row in group.iterrows():
# Filter for only values that are days_back for averaging.
days_back_fil = (group[time_col] < row[time_col]) & (group[time_col] >= row[time_col] + pd.Timedelta(days_back))
df = group[days_back_fil]
df['days-back'] = (row[time_col] - df[time_col]) / np.timedelta64(1, 'D') # need to divide by np.timedelta day to get number back
df['weight'] = df[value_col] * df['days-back']
try:
df['tw_avg'] = df['weight'].sum() / df['days-back'].sum()
time_avg = df['tw_avg'].iloc[0] # Get single value of the tw_avg
group.loc[idx, new_col_name] = time_avg
except ZeroDivisionError:
group.loc[idx, new_col_name] = fill_value
return group
然后您可以 return 使用此行查找您要查找的 DataFrame
df_test.groupby(by=['Cus_ID']).apply(tw_avg, 'Value', 'Date')
这将产生,
Cus_ID Date Value time_weighted_average
0 T313 2015-10-18 3.0 NaN
1 T348 2015-11-14 2.0 NaN
2 T313 2015-11-18 3.0 3.0
3 T348 2015-12-13 4.0 2.0
4 T313 2015-12-19 5.0 3.0
5 T348 2016-01-24 3.0 2.743362831858407
6 T329 2016-01-31 7.25 NaN
7 T329 2016-02-17 10.25 7.25
8 T348 2016-03-28 4.5 3.0
9 T313 2016-03-31 11.75 NaN
10 T329 2016-04-03 6.25 8.51605504587156
11 T348 2016-04-16 6.0 3.2794117647058822
您现在可以使用该函数通过 value_col
参数将加权平均值应用于其他值列,或者使用 days_back
参数更改时间 window 长度。有关如何描述时间增量,请参阅 pandas time deltas 页面。
我有以下客户销售历史数据框(这只是其中的一部分,实际数据框超过 70k 行):
import pandas as pd
import datetime as DT
df_test = pd.DataFrame({
'Cus_ID': ["T313","T348","T313","T348","T313","T348","T329","T329","T348","T313","T329","T348"],
'Value': [3,2,3,4,5,3,7.25,10.25,4.5,11.75,6.25,6],
'Date' : [
DT.datetime(2015,10,18),
DT.datetime(2015,11,14),
DT.datetime(2015,11,18),
DT.datetime(2015,12,13),
DT.datetime(2015,12,19),
DT.datetime(2016,1,24),
DT.datetime(2016,1,31),
DT.datetime(2016,2,17),
DT.datetime(2016,3,28),
DT.datetime(2016,3,31),
DT.datetime(2016,4,3),
DT.datetime(2016,4,16),
]})
我想向数据框添加一个新列以显示该客户过去 90 天的时间加权平均值结果。
预期结果(第 Value_Result
列):
Cus_ID Date Value Value_Result
0 T313 2015-10-18 3.00 NaN (No 90days history)
1 T348 2015-11-14 2.00 NaN (No 90days history)
2 T313 2015-11-18 3.00 3 (3*31)/31
3 T348 2015-12-13 4.00 2 (2*29)/29
4 T313 2015-12-19 5.00 3 (3*62+3*31)/(62+31)
5 T348 2016-01-24 3.00 2.743 (4*42+2*71)/(42+71)
6 T329 2016-01-31 7.25 NaN (No 90days history)
7 T329 2016-02-17 10.25 7.25 (7.25*17)/17
8 T348 2016-03-28 4.50 3 (3*64)/64
9 T313 2016-03-31 11.75 NaN (No 90days history)
10 T329 2016-04-03 6.25 8.516 (10.25*46+7.25*63)/(46+63)
11 T348 2016-04-16 6.00 3.279 (4.5*19+3*83)/(19+83)
我试过使用 groupby('Cus_ID')
和滚动应用,但我很难编写仅考虑向后 90 天的函数。
非常感谢任何意见。
我不确定滚动函数是否适合加权平均值,尽管也许其他人知道如何使用它 我不能保证这将是最优化的方法,但它会产生您想要的结果,您可以采用它并在必要时在此基础上进行构建。
非常感谢这个pbpython article。我建议通读它。
我的方法是创建一个将应用于组的函数(按 Cus_ID 分组)。此函数将遍历该组中的行并按照您上面的描述进行加权平均,将其应用回组和 return 组。为了解释清楚起见,此代码片段很冗长,如果需要,您可以通过删除所有创建的变量来 trim 简化它。
应用函数如下所示
def tw_avg(group, value_col, time_col, new_col_name="time_weighted_average", days_back='-90 days', fill_value=np.nan):
"""
Will calculate the weighted (by day) time average of the group passed.
It will not operate on the day it is evaulating but the previous days_back.
Should be used with the apply() function in Pandas with groupby function
Args:
group (pandas.DataFrame): Will be passed by pandas
value_col (str): Name of column with value to be averaged by weight
time_col (str): Name of column of with times in them
new_col_name (str): Name of new column to place time weighted average into, default: time_weighted_average
days_back (str): Time delta description as described in panda time deltas documentation, default: -90 days
fill_value (any): The value to fill rows which do not have data in days_back period, default: np.nan
Returns:
(pandas.DataFrame): The modified DataFrame with time weighted average added to columns, np.nan if no
time weight average exist
"""
for idx, row in group.iterrows():
# Filter for only values that are days_back for averaging.
days_back_fil = (group[time_col] < row[time_col]) & (group[time_col] >= row[time_col] + pd.Timedelta(days_back))
df = group[days_back_fil]
df['days-back'] = (row[time_col] - df[time_col]) / np.timedelta64(1, 'D') # need to divide by np.timedelta day to get number back
df['weight'] = df[value_col] * df['days-back']
try:
df['tw_avg'] = df['weight'].sum() / df['days-back'].sum()
time_avg = df['tw_avg'].iloc[0] # Get single value of the tw_avg
group.loc[idx, new_col_name] = time_avg
except ZeroDivisionError:
group.loc[idx, new_col_name] = fill_value
return group
然后您可以 return 使用此行查找您要查找的 DataFrame
df_test.groupby(by=['Cus_ID']).apply(tw_avg, 'Value', 'Date')
这将产生,
Cus_ID Date Value time_weighted_average
0 T313 2015-10-18 3.0 NaN
1 T348 2015-11-14 2.0 NaN
2 T313 2015-11-18 3.0 3.0
3 T348 2015-12-13 4.0 2.0
4 T313 2015-12-19 5.0 3.0
5 T348 2016-01-24 3.0 2.743362831858407
6 T329 2016-01-31 7.25 NaN
7 T329 2016-02-17 10.25 7.25
8 T348 2016-03-28 4.5 3.0
9 T313 2016-03-31 11.75 NaN
10 T329 2016-04-03 6.25 8.51605504587156
11 T348 2016-04-16 6.0 3.2794117647058822
您现在可以使用该函数通过 value_col
参数将加权平均值应用于其他值列,或者使用 days_back
参数更改时间 window 长度。有关如何描述时间增量,请参阅 pandas time deltas 页面。