Pandas - 具有可变长度滚动的聚合值 window
Pandas - aggregate values with a variable-length rolling window
以下数据框用作输入:
import pandas as pd
import numpy as np
json_string = '{"datetime":{"0":1528955662000,"1":1528959255000,"2":1528965487000,"3":1528966204000,"4":1528966289000,"5":1528971637000,"6":1528974438000,"7":1528975251000,"8":1528982200000,"9":1528992569000,"10":1528994282000},"hit":{"0":1,"1":0,"2":0,"3":0,"4":0,"5":1,"6":1,"7":0,"8":1,"9":0,"10":1}}'
df = pd.read_json(json_string)
该练习要求您计算每个时刻 (datetime
) 的 hit
列的平均值。但是,当前观察不应包含在平均值中。例如,第一个观察值 (index=0) 得到 np.NaN
,因为除了我们计算平均值的观察值之外没有其他观察值。第二个观察值 (index=1) 得到 1,因为 1/1 = 1(不包括来自第二个观察值的 0)。第三个观察值 (index=2) 得到 0.5,因为 (1+0)/2=0.5.
我的代码提供了正确答案(以数字表示)但不够优雅。我想知道你是否可以用不同的东西来完成练习。是否可以使用pandas.api.indexers.VariableOffsetWindowIndexer
或pandas.api.indexers.BaseIndexer
然后get_window_bounds()
方法?
我的解决方案:
def add_hr(df):
"""
Generate a feature `mean_hr` which represents the average hit rate
at the moment of making the offer (`datetime`).
Parameters
----------
df : pandas.DataFrame
The `hit` column must be present. Ascending/descending order in the `datetime`
column is not assumed.
hit : int
datetime : string (format='%Y-%m-%d %H:%M:%S')
Returns
----------
df_expanded : pandas.DataFrame
A (deep) copy of the input pandas.DataFrame.
"""
df_expanded = df.copy(deep=True)
df_expanded.sort_values(by=['datetime'], ascending=True, inplace=True)
df_expanded['mean_hr'] = df_expanded['hit'].expanding().mean()
srs = df_expanded['mean_hr']
srs = srs[:len(srs)-1]
srs = pd.concat([pd.Series([np.nan]), srs])
df_expanded['mean_hr'] = srs.tolist()
return df_expanded
完全免责声明:该练习是一个月前招聘过程的一部分。招聘已经结束,无法提交代码了
看来可以通过subclassing BaseIndexer
class:
来解决问题
from pandas.api.indexers import BaseIndexer
class CustomIndexer(BaseIndexer):
def get_window_bounds(self, num_values, min_periods, center, closed):
start = np.zeros(num_values, dtype='int64')
end = np.arange(0, num_values, dtype='int64')
return start, end
indexer = CustomIndexer(window_size=0)
df_expanded = df.copy(deep=True)
df_expanded = df_expanded.rolling(indexer).mean()
您想要实现的一个更简单的版本就是简单地移动扩展均值的索引,如下所示
df.sort_values(by=['datetime'], inplace=True)
df['mean_hit'] = df.expanding().mean().shift(1)
以下数据框用作输入:
import pandas as pd
import numpy as np
json_string = '{"datetime":{"0":1528955662000,"1":1528959255000,"2":1528965487000,"3":1528966204000,"4":1528966289000,"5":1528971637000,"6":1528974438000,"7":1528975251000,"8":1528982200000,"9":1528992569000,"10":1528994282000},"hit":{"0":1,"1":0,"2":0,"3":0,"4":0,"5":1,"6":1,"7":0,"8":1,"9":0,"10":1}}'
df = pd.read_json(json_string)
该练习要求您计算每个时刻 (datetime
) 的 hit
列的平均值。但是,当前观察不应包含在平均值中。例如,第一个观察值 (index=0) 得到 np.NaN
,因为除了我们计算平均值的观察值之外没有其他观察值。第二个观察值 (index=1) 得到 1,因为 1/1 = 1(不包括来自第二个观察值的 0)。第三个观察值 (index=2) 得到 0.5,因为 (1+0)/2=0.5.
我的代码提供了正确答案(以数字表示)但不够优雅。我想知道你是否可以用不同的东西来完成练习。是否可以使用pandas.api.indexers.VariableOffsetWindowIndexer
或pandas.api.indexers.BaseIndexer
然后get_window_bounds()
方法?
我的解决方案:
def add_hr(df):
"""
Generate a feature `mean_hr` which represents the average hit rate
at the moment of making the offer (`datetime`).
Parameters
----------
df : pandas.DataFrame
The `hit` column must be present. Ascending/descending order in the `datetime`
column is not assumed.
hit : int
datetime : string (format='%Y-%m-%d %H:%M:%S')
Returns
----------
df_expanded : pandas.DataFrame
A (deep) copy of the input pandas.DataFrame.
"""
df_expanded = df.copy(deep=True)
df_expanded.sort_values(by=['datetime'], ascending=True, inplace=True)
df_expanded['mean_hr'] = df_expanded['hit'].expanding().mean()
srs = df_expanded['mean_hr']
srs = srs[:len(srs)-1]
srs = pd.concat([pd.Series([np.nan]), srs])
df_expanded['mean_hr'] = srs.tolist()
return df_expanded
完全免责声明:该练习是一个月前招聘过程的一部分。招聘已经结束,无法提交代码了
看来可以通过subclassing BaseIndexer
class:
from pandas.api.indexers import BaseIndexer
class CustomIndexer(BaseIndexer):
def get_window_bounds(self, num_values, min_periods, center, closed):
start = np.zeros(num_values, dtype='int64')
end = np.arange(0, num_values, dtype='int64')
return start, end
indexer = CustomIndexer(window_size=0)
df_expanded = df.copy(deep=True)
df_expanded = df_expanded.rolling(indexer).mean()
您想要实现的一个更简单的版本就是简单地移动扩展均值的索引,如下所示
df.sort_values(by=['datetime'], inplace=True)
df['mean_hit'] = df.expanding().mean().shift(1)