计算多年 5 天运行百分位数

Question

我需要根据多年数据计算每个日历日的 3 天运行第 90 个百分位值。我有 30 年的每日数据集，如下所示，

            year    month   day  value
DATE                    
01/01/1980  1980    1       1    12.3957
02/01/1980  1980    1       2    8.2678
03/01/1980  1980    1       3    11.9438
04/01/1980  1980    1       4    8.8035
05/01/1980  1980    1       5    2.749
...             ... ... ...
27/01/2010  2010    1       27   4.1186
28/01/2010  2010    1       28   5.9619
29/01/2010  2010    1       29   8.8146
30/01/2010  2010    1       30   12.9397
31/01/2010  2010    1       31   11.8427

为了计算 1 月 1 日的第 90 个百分位值，我必须选择以 1 月 1 日为中心的 3 天 window 30 年。所以，我每天会有 90 (3*30) 个数据点。我可以计算百分位并将其记录为该中心日的第 90 个百分位值。然后，我将通过移动 3 天 window 对每一天重复此过程，直到我有一个新的数据框，其中填充了从 1 月 1 日到 12 月 31 日的每一天的百分位值。

问题是，我的数据集有时是正常的日历年（即 365/366 天），有时只有 365 天，或 360 天（12 个月*30 天）。我正在删除闰日，但我不知道哪个数据集是哪个。

我尝试了几天的迭代，但是当没有 2 月 29 日或 1 月 31 日时我遇到了问题。我尝试使用 for 循环在多个条件下进行切片，但我遇到了同样的问题。

我不知道如何选择 30 年的 3 天移动 window 并计算百分位数。

如有任何帮助，我们将不胜感激！

Answer 1

我已经设法解决了这个问题。首先，我放弃了 2 月 29 日。因此，我将有一个 365 天或 360 天的数据集。然后，我将日期时间索引更改为字符串。

df.index = df.index.strftime('%m-%d')

我在唯一索引值上使用了枚举来遍历所有天。我使用 if-else 块能够选择以一年的第一天和最后两天为中心的 5 天 window，否则 start 将大于 end。

对于 1 月 1 日，它将是

current = '01-01'
start = '12-30'
end = '01-03'

因此，我同时使用了 or | 和 and & 运算符来选择日期。

90p = {}
for count, date in enumerate(df.index.unique()):
    start = df.index[count-2]
    end = df.index[count+2]
    current = df.index[count]
    print("My date is %s, 2 day before is %s and 2 day later is %s" % (current, start, end))

    # day[0]
    if count == 0:
        temp = df.loc[((df.index >= current) & (df.index <= end)) | ((df.index >= df.index[count - 2]) & (df.index <= df.index[count - 1]))]
        90p[date] = temp.value.quantile(0.9)
    
    # day[1]
    elif count == 1:
        temp = df.loc[((df.index >= df.index[count - 1]) & (df.index <= end)) | ((df.index >= df.index[count - 2]) & (df.index <= df.index[count - 2]))]
        90p[date] = temp.value.quantile(0.9)
    
    # day[-2]
    elif count == len(df.index.unique()) - 2:
        temp = df.loc[((df.index >= df.index[count - 2]) & (df.index <= df.index[count + 1])) | ((df.index >= df.index[count + 2]) & (df.index <= df.index[count + 2]))]
        90p[date] = temp.value.quantile(0.9)

    # day[-1]
    elif count == len(df.index.unique()) - 1:
        temp = df.loc[((df.index >= df.index[count - 2]) & (df.index <= df.index[count])) | ((df.index >= df.index[count + 1]) & (df.index <= df.index[count + 2]))]
        90p[date] = temp.value.quantile(0.9)
    
    # day[2:-2]
    else:
        temp = df.loc[(df.index >= start) & (df.index <= end)]  
        90p[date] = temp.value.quantile(0.9)

计算多年 5 天运行百分位数

Calculating multi year 5-day running percentile

python

indexing

datetime

percentile

pandas

计算多年 5 天 运行 百分位数

Calculating multi year 5-day running percentile

python

indexing

datetime

percentile

pandas

计算多年 5 天运行百分位数