在重采样操作后初始化第一行和最后一行的值？

Question

例如，给定一个带有 1h Period 的数据帧，我想在新的 5h Period 开始和结束时分别在新列中设置 0 和 1 值。

让我们以这个输入数据为例：

import pandas as pd
from random import seed, randint
from collections import OrderedDict

p1h = pd.period_range(start='2020-02-01 00:00', end='2020-03-04 00:00', freq='1h', name='p1h')

seed(1)
values = [randint(0,10) for p in p1h]
df = pd.DataFrame({'Values' : values}, index=p1h)

结果

df.head(10)

                  Values
p1h                     
2020-02-01 00:00       2
2020-02-01 01:00       9
2020-02-01 02:00       1
2020-02-01 03:00       4
2020-02-01 04:00       1
2020-02-01 05:00       7
2020-02-01 06:00       7
2020-02-01 07:00       7
2020-02-01 08:00      10
2020-02-01 09:00       6

有什么方法可以设置一个新的列来得到下面的结果吗？（每个周期的第一行和最后一行分别用 0 和 1 初始化）

df['period5h'] = df.resample('5h').???

df.head(10)

                  Values   period5h
p1h                     
2020-02-01 00:00       2          0   <- 1st row of 5h period
2020-02-01 01:00       9
2020-02-01 02:00       1
2020-02-01 03:00       4
2020-02-01 04:00       1          1   <- last row of 5h period
2020-02-01 05:00       7          0   <- 1st row of 5h period
2020-02-01 06:00       7
2020-02-01 07:00       7
2020-02-01 08:00      10
2020-02-01 09:00       6          1   <- last row of 5h period

拜托，这可以用 pandas 中的某些功能以某种方式完成吗？

然后，最终目标是通过 0 和 1 之间的线性插值来填充空值，以便获得相对于 5 小时周期的当前行的百分比进度。

另一个曲目/问题

另一种方法可能是使用 5h PeriodIndex 初始化第二个 DataFrame，将新列的值初始化为 1，然后将 PeriodIndex 上采样回 1H 合并两个 DataFrame。

shift(-1) 将初始化句点的最后一行。

我会重复这个过程，而不移动值 0。

那么，我该如何创建这个新的 DataFrame 以便我可以将它合并到 1st？我尝试了一些合并命令，但出现错误，提示我两个索引的频率不同。

感谢您的帮助！最佳

Answer 1

不是大多数 pythonic 方法，但它有效。

import pandas as pd
from random import seed, randint
from collections import OrderedDict
import time
p1h = pd.period_range(start='2020-02-01 00:00', end='2040-03-04 00:00', freq='1h', name='p1h')

seed(1)
values = [randint(0,10) for p in p1h]
df = pd.DataFrame({'Values' : values}, index=p1h)

t1 = time.time()
for i in range(len(df['Values'])):
  if (i+1)% 5 == 1:
    df['Values'].iloc[i] = 0
  elif (i+1) % 5 == 0:
    df['Values'].iloc[i] = 1
t2 = time.time()
df.head(20)

print(t2-t1)

时间：8.770591259002686

方法二：

import pandas as pd
from random import seed, randint
from collections import OrderedDict
import time
p1h = pd.period_range(start='2020-02-01 00:00', end='2040-03-04 00:00', freq='1h', name='p1h')

seed(1)
values = [randint(0,10) for p in p1h]
df = pd.DataFrame({'Values' : values}, index=p1h)

t1 = time.time()

df['Values'].iloc[range(0,len(df['Values']),5)] = 0
df['Values'].iloc[range(4,len(df['Values']),5)] = 1
t2 = time.time()
df.head(20)

print(t2-t1)

时间：0.009400367736816406

Answer 2

使用重采样对象的 indices 属性来查找组的第一个和最后一个索引。即使数据没有规则频率，或者其频率不能完全除以重采样频率，这也会起作用。组将只有一个测量值设置为 1 而不是 0。然后我们相应地设置值

i1 = [] # Last `.iloc` index labels
i0 = [] # First `.iloc` index labels
for k,v in df.resample('5H').indices.items():
    i0.append(v[0])
    i1.append(v[-1])

df.loc[df.index[i0], 'period_5H'] = 0
df.loc[df.index[i1], 'period_5H'] = 1

                  Values  period_5H
p1h                                
2020-02-01 00:00       2        0.0
2020-02-01 01:00       9        NaN
2020-02-01 02:00       1        NaN
2020-02-01 03:00       4        NaN
2020-02-01 04:00       1        1.0
2020-02-01 05:00       7        0.0
2020-02-01 06:00       7        NaN
2020-02-01 07:00       7        NaN
2020-02-01 08:00      10        NaN
2020-02-01 09:00       6        1.0
2020-02-01 10:00       3        0.0
...

Answer 3

好的，我最终设置为使用以下相当快的方法（无循环）

 super_pi = pd.period_range(start='2020-01-01 00:00', end='2020-06-01 00:00', freq='5h', name='p5h')
 super_df = pd.DataFrame({'End' : 1, 'Start' : 0}, index=super_pi).resample('1h').first()
 # We know last row is a 1 (end of period)
 super_df['End'] = super_df['End'].shift(-1, fill_value=1)
 super_df['Period'] = super_df[['End','Start']].sum(axis=1, min_count=1)

结果

 supder_df.head(10)

                   End  Start  Period
 p5h                                 
 2020-01-01 00:00  NaN    0.0     0.0
 2020-01-01 01:00  NaN    NaN     NaN
 2020-01-01 02:00  NaN    NaN     NaN
 2020-01-01 03:00  NaN    NaN     NaN
 2020-01-01 04:00  1.0    NaN     1.0
 2020-01-01 05:00  NaN    0.0     0.0
 2020-01-01 06:00  NaN    NaN     NaN
 2020-01-01 07:00  NaN    NaN     NaN
 2020-01-01 08:00  NaN    NaN     NaN

最佳，

在重采样操作后初始化第一行和最后一行的值？

Initializing values for the first and last row following a resample operation?

python

period

pandas

另一个曲目/问题