Pandas: 过去n天的平均值
Pandas: Average value for the past n days
我有一个这样的 Pandas
数据框:
test = pd.DataFrame({ 'Date' : ['2016-04-01','2016-04-01','2016-04-02',
'2016-04-02','2016-04-03','2016-04-04',
'2016-04-05','2016-04-06','2016-04-06'],
'User' : ['Mike','John','Mike','John','Mike','Mike',
'Mike','Mike','John'],
'Value' : [1,2,1,3,4.5,1,2,3,6]
})
正如您在下面看到的,数据集不一定每天都有观测值:
Date User Value
0 2016-04-01 Mike 1.0
1 2016-04-01 John 2.0
2 2016-04-02 Mike 1.0
3 2016-04-02 John 3.0
4 2016-04-03 Mike 4.5
5 2016-04-04 Mike 1.0
6 2016-04-05 Mike 2.0
7 2016-04-06 Mike 3.0
8 2016-04-06 John 6.0
我想添加一个新列,显示过去 n 天(在本例中为 n = 2)每个用户的平均值,前提是至少有一天可用,否则它将有 nan
值。例如,在 2016-04-06
上,约翰得到 nan
,因为他没有 2016-04-05
和 2016-04-04
的数据。所以结果将是这样的:
Date User Value Value_Average_Past_2_days
0 2016-04-01 Mike 1.0 NaN
1 2016-04-01 John 2.0 NaN
2 2016-04-02 Mike 1.0 1.00
3 2016-04-02 John 3.0 2.00
4 2016-04-03 Mike 4.5 1.00
5 2016-04-04 Mike 1.0 2.75
6 2016-04-05 Mike 2.0 2.75
7 2016-04-06 Mike 3.0 1.50
8 2016-04-06 John 6.0 NaN
看了论坛里的好几篇帖子,好像应该把group_by
和自定义的rolling_mean
组合起来,但是一直想不通。
n = 2
# Cast your dates as timestamps.
test['Date'] = pd.to_datetime(test.Date)
# Create a daily index spanning the range of the original index.
idx = pd.date_range(test.Date.min(), test.Date.max(), freq='D')
# Pivot by Dates and Users.
df = test.pivot(index='Date', values='Value', columns='User').reindex(idx)
>>> df.head(3)
User John Mike
2016-04-01 2 1.0
2016-04-02 3 1.0
2016-04-03 NaN 4.5
# Apply a rolling mean on the above dataframe and reset the index.
df2 = (pd.rolling_mean(df.shift(), n, min_periods=1)
.reset_index()
.drop_duplicates())
# For Pandas 0.18.0+
df2 = (df.shift().rolling(window=n, min_periods=1).mean()
.reset_index()
.drop_duplicates())
# Melt the result back into the original form.
df3 = (pd.melt(df2, id_vars='Date', value_name='Value')
.sort_values(['Date', 'User'])
.reset_index(drop=True))
>>> df3.head()
Date User Value
0 2016-04-01 John NaN
1 2016-04-01 Mike NaN
2 2016-04-02 John 2.0
3 2016-04-02 Mike 1.0
4 2016-04-03 John 2.5
# Merge the results back into the original dataframe.
>>> test.merge(df3, on=['Date', 'User'], how='left',
suffixes=['', '_Average_past_{0}_days'.format(n)])
Date User Value Value_Average_past_2_days
0 2016-04-01 Mike 1.0 NaN
1 2016-04-01 John 2.0 NaN
2 2016-04-02 Mike 1.0 1.00
3 2016-04-02 John 3.0 2.00
4 2016-04-03 Mike 4.5 1.00
5 2016-04-04 Mike 1.0 2.75
6 2016-04-05 Mike 2.0 2.75
7 2016-04-06 Mike 3.0 1.50
8 2016-04-06 John 6.0 NaN
总结
n = 2
test['Date'] = pd.to_datetime(test.Date)
idx = pd.date_range(test.Date.min(), test.Date.max(), freq='D')
df = test.pivot(index='Date', values='Value', columns='User').reindex(idx)
df2 = (pd.rolling_mean(df.shift(), n, min_periods=1)
.reset_index()
.drop_duplicates())
df3 = (pd.melt(df2, id_vars='Date', value_name='Value')
.sort_values(['Date', 'User'])
.reset_index(drop=True))
test.merge(df3, on=['Date', 'User'], how='left',
suffixes=['', '_Average_past_{0}_days'.format(n)])
我认为你可以先使用转换列 Date
to_datetime
, then find missing Days
by groupby
with resample
and last apply
rolling
test['Date'] = pd.to_datetime(test['Date'])
df = test.groupby('User').apply(lambda x: x.set_index('Date').resample('1D').first())
print df
User Value
User Date
John 2016-04-01 John 2.0
2016-04-02 John 3.0
2016-04-03 NaN NaN
2016-04-04 NaN NaN
2016-04-05 NaN NaN
2016-04-06 John 6.0
Mike 2016-04-01 Mike 1.0
2016-04-02 Mike 1.0
2016-04-03 Mike 4.5
2016-04-04 Mike 1.0
2016-04-05 Mike 2.0
df1 = df.groupby(level=0)['Value']
.apply(lambda x: x.shift().rolling(min_periods=1,window=2).mean())
.reset_index(name='Value_Average_Past_2_days')
print df1
User Date Value_Average_Past_2_days
0 John 2016-04-01 NaN
1 John 2016-04-02 2.00
2 John 2016-04-03 2.50
3 John 2016-04-04 3.00
4 John 2016-04-05 NaN
5 John 2016-04-06 NaN
6 Mike 2016-04-01 NaN
7 Mike 2016-04-02 1.00
8 Mike 2016-04-03 1.00
9 Mike 2016-04-04 2.75
10 Mike 2016-04-05 2.75
11 Mike 2016-04-06 1.50
print pd.merge(test, df1, on=['Date', 'User'], how='left')
Date User Value Value_Average_Past_2_days
0 2016-04-01 Mike 1.0 NaN
1 2016-04-01 John 2.0 NaN
2 2016-04-02 Mike 1.0 1.00
3 2016-04-02 John 3.0 2.00
4 2016-04-03 Mike 4.5 1.00
5 2016-04-04 Mike 1.0 2.75
6 2016-04-05 Mike 2.0 2.75
7 2016-04-06 Mike 3.0 1.50
8 2016-04-06 John 6.0 NaN
用 groupby 计算 30 天/1 个月的滚动平均值
df_px = df_px.set_index(pd.to_datetime(df_px['date']))
df_px['px_avg30d']=df_px.groupby('stock')['px'].transform(lambda x: x.rolling('30D').mean())
可重现的例子
import pandas_datareader as pddr
df = pddr.DataReader(['CAT','WMT'], 'yahoo', datetime.date(2021,6,30), datetime.date(2022,1,1))
df_px = df['Adj Close'].copy()
df_px = df_px.resample('W-MON').first()
df_px = df_px.sample(frac=0.33, random_state=0).sort_index()
df_px['date']=df_px.index.astype(str).str[:10]
df_px = df_px.melt(id_vars=['date'])
df_px.columns = ['date','stock','px']
df_px = df_px.set_index(pd.to_datetime(df_px['date']))
df_px['px_avg30d']=df_px.groupby('stock')['px'].transform(lambda x: x.rolling('30D').mean())
df_px['px_avg3']=df_px.groupby('stock')['px'].transform(lambda x: x.rolling(3,min_periods=1).mean())
df_px['px_avg4']=df_px.groupby('stock')['px'].transform(lambda x: x.rolling(4,min_periods=1).mean())
我有一个这样的 Pandas
数据框:
test = pd.DataFrame({ 'Date' : ['2016-04-01','2016-04-01','2016-04-02',
'2016-04-02','2016-04-03','2016-04-04',
'2016-04-05','2016-04-06','2016-04-06'],
'User' : ['Mike','John','Mike','John','Mike','Mike',
'Mike','Mike','John'],
'Value' : [1,2,1,3,4.5,1,2,3,6]
})
正如您在下面看到的,数据集不一定每天都有观测值:
Date User Value
0 2016-04-01 Mike 1.0
1 2016-04-01 John 2.0
2 2016-04-02 Mike 1.0
3 2016-04-02 John 3.0
4 2016-04-03 Mike 4.5
5 2016-04-04 Mike 1.0
6 2016-04-05 Mike 2.0
7 2016-04-06 Mike 3.0
8 2016-04-06 John 6.0
我想添加一个新列,显示过去 n 天(在本例中为 n = 2)每个用户的平均值,前提是至少有一天可用,否则它将有 nan
值。例如,在 2016-04-06
上,约翰得到 nan
,因为他没有 2016-04-05
和 2016-04-04
的数据。所以结果将是这样的:
Date User Value Value_Average_Past_2_days
0 2016-04-01 Mike 1.0 NaN
1 2016-04-01 John 2.0 NaN
2 2016-04-02 Mike 1.0 1.00
3 2016-04-02 John 3.0 2.00
4 2016-04-03 Mike 4.5 1.00
5 2016-04-04 Mike 1.0 2.75
6 2016-04-05 Mike 2.0 2.75
7 2016-04-06 Mike 3.0 1.50
8 2016-04-06 John 6.0 NaN
看了论坛里的好几篇帖子,好像应该把group_by
和自定义的rolling_mean
组合起来,但是一直想不通。
n = 2
# Cast your dates as timestamps.
test['Date'] = pd.to_datetime(test.Date)
# Create a daily index spanning the range of the original index.
idx = pd.date_range(test.Date.min(), test.Date.max(), freq='D')
# Pivot by Dates and Users.
df = test.pivot(index='Date', values='Value', columns='User').reindex(idx)
>>> df.head(3)
User John Mike
2016-04-01 2 1.0
2016-04-02 3 1.0
2016-04-03 NaN 4.5
# Apply a rolling mean on the above dataframe and reset the index.
df2 = (pd.rolling_mean(df.shift(), n, min_periods=1)
.reset_index()
.drop_duplicates())
# For Pandas 0.18.0+
df2 = (df.shift().rolling(window=n, min_periods=1).mean()
.reset_index()
.drop_duplicates())
# Melt the result back into the original form.
df3 = (pd.melt(df2, id_vars='Date', value_name='Value')
.sort_values(['Date', 'User'])
.reset_index(drop=True))
>>> df3.head()
Date User Value
0 2016-04-01 John NaN
1 2016-04-01 Mike NaN
2 2016-04-02 John 2.0
3 2016-04-02 Mike 1.0
4 2016-04-03 John 2.5
# Merge the results back into the original dataframe.
>>> test.merge(df3, on=['Date', 'User'], how='left',
suffixes=['', '_Average_past_{0}_days'.format(n)])
Date User Value Value_Average_past_2_days
0 2016-04-01 Mike 1.0 NaN
1 2016-04-01 John 2.0 NaN
2 2016-04-02 Mike 1.0 1.00
3 2016-04-02 John 3.0 2.00
4 2016-04-03 Mike 4.5 1.00
5 2016-04-04 Mike 1.0 2.75
6 2016-04-05 Mike 2.0 2.75
7 2016-04-06 Mike 3.0 1.50
8 2016-04-06 John 6.0 NaN
总结
n = 2
test['Date'] = pd.to_datetime(test.Date)
idx = pd.date_range(test.Date.min(), test.Date.max(), freq='D')
df = test.pivot(index='Date', values='Value', columns='User').reindex(idx)
df2 = (pd.rolling_mean(df.shift(), n, min_periods=1)
.reset_index()
.drop_duplicates())
df3 = (pd.melt(df2, id_vars='Date', value_name='Value')
.sort_values(['Date', 'User'])
.reset_index(drop=True))
test.merge(df3, on=['Date', 'User'], how='left',
suffixes=['', '_Average_past_{0}_days'.format(n)])
我认为你可以先使用转换列 Date
to_datetime
, then find missing Days
by groupby
with resample
and last apply
rolling
test['Date'] = pd.to_datetime(test['Date'])
df = test.groupby('User').apply(lambda x: x.set_index('Date').resample('1D').first())
print df
User Value
User Date
John 2016-04-01 John 2.0
2016-04-02 John 3.0
2016-04-03 NaN NaN
2016-04-04 NaN NaN
2016-04-05 NaN NaN
2016-04-06 John 6.0
Mike 2016-04-01 Mike 1.0
2016-04-02 Mike 1.0
2016-04-03 Mike 4.5
2016-04-04 Mike 1.0
2016-04-05 Mike 2.0
df1 = df.groupby(level=0)['Value']
.apply(lambda x: x.shift().rolling(min_periods=1,window=2).mean())
.reset_index(name='Value_Average_Past_2_days')
print df1
User Date Value_Average_Past_2_days
0 John 2016-04-01 NaN
1 John 2016-04-02 2.00
2 John 2016-04-03 2.50
3 John 2016-04-04 3.00
4 John 2016-04-05 NaN
5 John 2016-04-06 NaN
6 Mike 2016-04-01 NaN
7 Mike 2016-04-02 1.00
8 Mike 2016-04-03 1.00
9 Mike 2016-04-04 2.75
10 Mike 2016-04-05 2.75
11 Mike 2016-04-06 1.50
print pd.merge(test, df1, on=['Date', 'User'], how='left')
Date User Value Value_Average_Past_2_days
0 2016-04-01 Mike 1.0 NaN
1 2016-04-01 John 2.0 NaN
2 2016-04-02 Mike 1.0 1.00
3 2016-04-02 John 3.0 2.00
4 2016-04-03 Mike 4.5 1.00
5 2016-04-04 Mike 1.0 2.75
6 2016-04-05 Mike 2.0 2.75
7 2016-04-06 Mike 3.0 1.50
8 2016-04-06 John 6.0 NaN
用 groupby 计算 30 天/1 个月的滚动平均值
df_px = df_px.set_index(pd.to_datetime(df_px['date']))
df_px['px_avg30d']=df_px.groupby('stock')['px'].transform(lambda x: x.rolling('30D').mean())
可重现的例子
import pandas_datareader as pddr
df = pddr.DataReader(['CAT','WMT'], 'yahoo', datetime.date(2021,6,30), datetime.date(2022,1,1))
df_px = df['Adj Close'].copy()
df_px = df_px.resample('W-MON').first()
df_px = df_px.sample(frac=0.33, random_state=0).sort_index()
df_px['date']=df_px.index.astype(str).str[:10]
df_px = df_px.melt(id_vars=['date'])
df_px.columns = ['date','stock','px']
df_px = df_px.set_index(pd.to_datetime(df_px['date']))
df_px['px_avg30d']=df_px.groupby('stock')['px'].transform(lambda x: x.rolling('30D').mean())
df_px['px_avg3']=df_px.groupby('stock')['px'].transform(lambda x: x.rolling(3,min_periods=1).mean())
df_px['px_avg4']=df_px.groupby('stock')['px'].transform(lambda x: x.rolling(4,min_periods=1).mean())