Pandas - 滚动 window - CustomIndex - 右边界不包含在 window 中求和

Pandas - Rolling window - CustomIndex - right bound is not included in window for sum

我发现了 CustomIndexer,我可以看到 'end'(左边界)不包括在我想做的后续总和中。

这会导致两个问题:

为了解决第一个后果,我求助于包括下一行以确保 window 在我希望它结束​​的地方结束。

但是,对于第二个问题,我没有退路。

原代码

所以我在一个单独的函数中测试了第一个自定义 window 以简化调试。

import pandas as pd
import numpy as np

def custom_bounds(num_values, index, date_range):
    start = np.empty(num_values, dtype=np.int64)
    end = np.empty(num_values, dtype=np.int64)        
    ind_as_int = index.to_series().reset_index(drop=True) 
    dr_as_series = date_range.to_series()
    # 1st item is skipped and default to 0
    start[0]=0
   end[0]=0
    # Loop for other items
    for i in range(num_values)[1:]:
        previous_ts_in_dr = dr_as_series.loc[dr_as_series.index < ind_as_int.iat[i]].index[-1]
        start[i] = ind_as_int.loc[ind_as_int >= previous_ts_in_dr].index[0]
        end[i] = i-1
return start, end

输入数据作为例子

我可以使用以下输入值对其进行测试。

from random import seed
from random import randint

# DataFrame
ts_1h = pd.date_range(start='2020-01-01 00:00+00:00', end='2020-01-02 00:00+00:00', freq='1h')
seed(1)
values = [randint(0,10) for ts in ts_1h]
df = pd.DataFrame({'Values' : values}, index=ts_1h)
df.index.name='Timestamp'

# Processing
dr = pd.date_range(start='2019-12-31 23:00+00:00', end='2020-01-03 00:00+00:00', freq='3h')

运行它:

In [20]: df.head(4)
Out[20]: 
                           Values
Timestamp                                   
2020-01-01 00:00:00+00:00       2
2020-01-01 01:00:00+00:00       9
2020-01-01 02:00:00+00:00       1
2020-01-01 03:00:00+00:00       4

运行带输入数据的原始代码

start, end = custom_bounds(num_values=df.shape[0], index=df.index, date_range=dr)

df_2 = pd.DataFrame({'int' : df.reset_index().index,
                 'start' : start,
                 'end' : end},
                index = df.index)
df_2.loc[df_2.index.isin(dr), 'TS_3h'] = 'X'

所以基本上,在 df_2 中,我们可以看到标记自定义 windows 开始和结束的整数。这两个边界都必须包含在 rolliwng window 中。我对您可以阅读的价值观很好。

In [22]: df_2.head(6)
Out[22]: 
                           int  start  end TS_3h
Timestamp                                       
2020-01-01 00:00:00+00:00    0      0    0   NaN
2020-01-01 01:00:00+00:00    1      0    0   NaN
2020-01-01 02:00:00+00:00    2      0    1     X
2020-01-01 03:00:00+00:00    3      2    2   NaN
2020-01-01 04:00:00+00:00    4      2    3   NaN
2020-01-01 05:00:00+00:00    5      2    4     X

所以我对下一步充满信心。我打算看到以下总和:

正在实施 CustomIndexer & 运行 它

所以我将我的代码集成到自定义 'get_window_bounds()' 中,如下所示。

from pandas.api.indexers import BaseIndexer


class CustomIndexer(BaseIndexer):

    def get_window_bounds(self, num_values, min_periods, center, closed):
        start = np.empty(num_values, dtype=np.int64)
        end = np.empty(num_values, dtype=np.int64)        
        ind_as_int = self.index.to_series().reset_index(drop=True) 
        dr_as_series = self.date_range.to_series()
        # 1st item is skipped and default to 0
        start[0]=0
        end[0]=0
        # Loop for other items
        for i in range(num_values)[1:]:
            previous_ts_in_dr = dr_as_series.loc[dr_as_series.index < ind_as_int.iat[i]].index[-1]
            start[i] = ind_as_int.loc[ind_as_int >= previous_ts_in_dr].index[0]
            end[i] = i-1
        return start, end

indexer = CustomIndexer(index=df.index, date_range=dr, closed='both')
df['Sum'] = df.rolling(indexer).sum()
df.loc[df.index.isin(dr), 'TS_3h'] = 'X'

运行它:

In [25]: df.head(4)
Out[25]: 
                           Values  Sum TS_3h
Timestamp                                   
2020-01-01 00:00:00+00:00       2  0.0   NaN
2020-01-01 01:00:00+00:00       9  0.0   NaN
2020-01-01 02:00:00+00:00       1  2.0     X
2020-01-01 03:00:00+00:00       4  0.0   NaN

如上所述,我希望看到以下结果:

所以问题是:如何确保右边界包含在总和的计算中?

感谢您的帮助。

好的,通过干预索引解决了。抱歉打扰了。

    def get_window_bounds(self, num_values, min_periods, center, closed):
        start = np.empty(num_values, dtype=np.int64)
        end = np.empty(num_values, dtype=np.int64)        
        ind_as_int = self.index.to_series().reset_index(drop=True) 
        dr_as_series = self.date_range.to_series()
        # Loop over items
        for i in range(num_values):
            previous_ts_in_dr = dr_as_series.loc[dr_as_series.index < ind_as_int.iat[i]].index[-1]
            start[i] = ind_as_int.loc[ind_as_int >= previous_ts_in_dr].index[0]
            end[i] = i
        # Correct end[0]
        end[0]=1
        return start, end