从不同年份开始并具有不同 window 长度的时间序列的移动平均线

Moving average along a time series starting at different years and with different window length

我有一个包含两个变量的数据框。 第一个变量是从 1 到 2016 连续跨越的年份,第二个变量是我的兴趣值。例如

ts <- data.frame(Year=c(1:2016), TS=sample(seq(from=-1.5, to=1.5,by=0.01), size=2016, replace=TRUE))

我需要计算从上次观察(2016-50=1966)前 50 年开始的 21 个移动平均值,每个都包含 window,范围从 10、11、12、...30 年,第一个值与 2016 年对齐。 我需要的是一个包含 23 列(“年”、“TS”、“av10”、“av11”、“av12”、...“av30”)和从 1 年到 2016 年的 2016 行的数据框。 因此,我这样做了:

subset1966 <- ts[1:1966,]
subset1966$Year.50 <- c(51:2016)

ts.50yrs <- subset1966%>%
  mutate(av10 = rollmean(TS, k = 10, fill = NA, align = "right"),
         av11 = rollmean(TS, k = 11, fill = NA, align = "right"),
         av12 = rollmean(TS, k = 12, fill = NA, align = "right"),
         av13 = rollmean(TS, k = 13, fill = NA, align = "right"),
         av14 = rollmean(TS, k = 14, fill = NA, align = "right"),
         av15 = rollmean(TS, k = 15, fill = NA, align = "right"),
         av16 = rollmean(TS, k = 16, fill = NA, align = "right"),
         av17 = rollmean(TS, k = 17, fill = NA, align = "right"),
         av18 = rollmean(TS, k = 18, fill = NA, align = "right"),
         av19 = rollmean(TS, k = 19, fill = NA, align = "right"),
         av20 = rollmean(TS, k = 20, fill = NA, align = "right"),
         av21 = rollmean(TS, k = 21, fill = NA, align = "right"),
         av22 = rollmean(TS, k = 22, fill = NA, align = "right"),
         av23 = rollmean(TS, k = 23, fill = NA, align = "right"),
         av24 = rollmean(TS, k = 24, fill = NA, align = "right"),
         av25 = rollmean(TS, k = 25, fill = NA, align = "right"),
         av26 = rollmean(TS, k = 26, fill = NA, align = "right"),
         av27 = rollmean(TS, k = 27, fill = NA, align = "right"),
         av28 = rollmean(TS, k = 28, fill = NA, align = "right"),
         av29 = rollmean(TS, k = 29, fill = NA, align = "right"),
         av30 = rollmean(TS, k = 30, fill = NA, align = "right"))

ts.50yrs.df <- ts.50yrs[,-c(1:2)]
colnames(ts.50yrs.df)[1] <- "Year"
df.ts.50 <- full_join(ts,ts.50yrs.df)

我在最后一次观察之前的 100、150、200 和 250 年开始的不同年份重复了相同的程序。 例如

subset1766 <- data[1:1766,]
subset1766$Year.250 <- c(251:2016)
ts.250yrs <- subset1766%>%
  mutate(av10 = rollmean(TS, k = 10, fill = NA, align = "right"),
         av11 = rollmean(TS, k = 11, fill = NA, align = "right"),
         av12 = rollmean(TS, k = 12, fill = NA, align = "right"),
         av13 = rollmean(TS, k = 13, fill = NA, align = "right"),
         av14 = rollmean(TS, k = 14, fill = NA, align = "right"),
         av15 = rollmean(TS, k = 15, fill = NA, align = "right"),
         av16 = rollmean(TS, k = 16, fill = NA, align = "right"),
         av17 = rollmean(TS, k = 17, fill = NA, align = "right"),
         av18 = rollmean(TS, k = 18, fill = NA, align = "right"),
         av19 = rollmean(TS, k = 19, fill = NA, align = "right"),
         av20 = rollmean(TS, k = 20, fill = NA, align = "right"),
         av21 = rollmean(TS, k = 21, fill = NA, align = "right"),
         av22 = rollmean(TS, k = 22, fill = NA, align = "right"),
         av23 = rollmean(TS, k = 23, fill = NA, align = "right"),
         av24 = rollmean(TS, k = 24, fill = NA, align = "right"),
         av25 = rollmean(TS, k = 25, fill = NA, align = "right"),
         av26 = rollmean(TS, k = 26, fill = NA, align = "right"),
         av27 = rollmean(TS, k = 27, fill = NA, align = "right"),
         av28 = rollmean(TS, k = 28, fill = NA, align = "right"),
         av29 = rollmean(TS, k = 29, fill = NA, align = "right"),
         av30 = rollmean(TS, k = 30, fill = NA, align = "right"))

ts.250yrs.df <- ts.250yrs[,-c(1:2)]
colnames(ts.250yrs.df)[1] <- "Year"
df.ts.250 <- full_join(ts,ts.50yrs.df)

然后我将数据框合并在一起。例如:

df <- cbind(df.ts.50, df.ts.100, df.ts.150, df.ts.200, df.ts.250)

但是,我接下来想做的是重复相同的移动平均线,起始年份从 50、51、52、...250 开始,这意味着从 1966 年 (2016-50) 开始,然后继续直到 1766 年 (2016-250),每个起始年的移动平均值范围为 10、11、12、...30 年。此外,所有值都需要与 2016 年保持一致。

所以,这将是一个包含 4221 列的数据框(201(= 50 到 250 年)* 21(= windows 从 10 到 30 年)),加上“年”的两列”和“TS”。列名称类似于“Year”、“TS”、“av50.10yr”、“av50.11yr”、“av50.12yr”、...、“av50.30yr”、“av51.10yr”、“ av51.11yr, "av51.12yr", ... "av51.30yr", ... "av250.10yr", "av250.11yr", "av250.12yr", ..., "av250.30yr" ) 和从第 1 年到 2016 年的 2016 行。

您可以将逻辑放入一个函数 makeAvg() 中,它仅需要年数并从 max(dat$year) 中减去,即 50 给出所需的 1:1966

# makeAvg <- \(s, ks=10:30, x='TS', data=dat) {
#   sbs <- 1:(max(data$Year) - s)
#   X <- data[sbs, ]
#   cbind(X, sapply(ks, \(k) zoo::rollmean(X[[x]], k=k, fill=NA, align="right")) |>
#           `colnames<-`(paste0('av', ks, '_', s)))
# }

(编辑:为避免速度变慢 zoo::rollmean,试试下面这个版本,它使用 RcppRoll::roll_mean 大约是 快20倍.)

makeAvg <- \(s, ks=10:30, x='TS', data=dat) {
  sbs <- 1:(max(data$Year) - s)
  X <- data[sbs, ]
  cbind(X, 
        sapply(ks, \(k) RcppRoll::roll_mean(X[[x]], n=k, fill=NA, align='r')) |>
          `colnames<-`(paste0('av', ks, '_', s)))
}

注:R版本>=4.1

然后抛出这个:

makeAvg(50)[8:13, ]
#    Year    TS av10_50    av11_50    av12_50    av13_50 av14_50 av15_50 av16_50
# 8     8 -0.23      NA         NA         NA         NA      NA      NA      NA
# 9     9 -1.27      NA         NA         NA         NA      NA      NA      NA
# 10   10 -0.62  -0.448         NA         NA         NA      NA      NA      NA
# 11   11  0.14  -0.332 -0.3945455         NA         NA      NA      NA      NA
# 12   12 -0.41  -0.375 -0.3390909 -0.3958333         NA      NA      NA      NA
# 13   13 -1.31  -0.429 -0.4600000 -0.4200000 -0.4661538      NA      NA      NA
#    av17_50 av18_50 av19_50 av20_50 av21_50 av22_50 av23_50 av24_50 av25_50 av26_50
# 8       NA      NA      NA      NA      NA      NA      NA      NA      NA      NA
# 9       NA      NA      NA      NA      NA      NA      NA      NA      NA      NA
# 10      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA
# 11      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA
# 12      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA
# 13      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA
#    av27_50 av28_50 av29_50 av30_50
# 8       NA      NA      NA      NA
# 9       NA      NA      NA      NA
# 10      NA      NA      NA      NA
# 11      NA      NA      NA      NA
# 12      NA      NA      NA      NA
# 13      NA      NA      NA      NA

然后把starts放到一个向量中,

starts <- seq(50, 250, 1)

你使用 lapply 在其上循环,然后用 merge():

将其放入 Reduce()
res <- Reduce(\(...) merge(..., all=TRUE), lapply(starts, makeAvg))  ## runs ~25 s

这给出了一个完全符合预期尺寸的数据框,

dim(res)
# [1] 1966 4223

还有这些名字:

names(res)
# [1] "Year"     "TS"       "av10_50"  "av11_50"  "av12_50"  "av13_50"  "av14_50" 
# [8] "av15_50"  "av16_50"  "av17_50"  "av18_50"  "av19_50"  "av20_50"  "av21_50" 
# [...]

数据:

set.seed(42)
dat <- data.frame(Year=c(1:2016),
                  TS=sample(seq(from=-1.5, to=1.5,by=0.01), size=2016, replace=TRUE))

在这里,我创建了一个自定义函数 moving_mean(),它获取数据框和起始年份。然后,我使用 mutate 创建一个新列,其起始年份序列(例如 50)到 2016 年。然后,我创建了一个名称列表,这些名称将用于数据框中的列名称(例如,av50.10yr).然后,我使用 map2mutate 滚动年数的新列(例如,10、11、...、30)。然后,我使用 purrr::reduce 将给定起始年份(例如 50)的所有数据帧合并为一个。接下来,我在原始数据帧 ts 上从 50 到 250 的每个起始年份使用 map 到 运行 函数 moving_mean。然后,我将 201 个数据帧放入同一个数据帧中(再次使用 reduce)。最后,我将这个数据框加入了原始数据框,它向 ts.

添加了 4,221 个新列
library(tidyverse)

moving_mean <- function(data, y){
  new_year <- paste0("Year.", as.character(y))
  
  x_new <- data %>% 
    filter(Year <= (tail(Year, n=1) - y)) %>% 
    rowwise %>% 
    dplyr::mutate({{new_year}} := (Year + y)) %>% 
    ungroup()
  
  varnames <- unlist(map(10:30, function(x) paste0("av", y, ".", x, "yr")))
  
  map2(10:30, varnames, function(x, y) x_new %>% 
         dplyr::mutate({{y}} := rollmean(TS, k = x, fill = NA, align = "right")) %>% 
         as.data.frame()) %>% 
    reduce(left_join, by = c(names(x_new)[1:3])) %>% 
    select(-c("Year", "TS")) %>% 
    dplyr::rename(Year = 1)
}

results <- map(c(seq(50, 250, 1)), ~ moving_mean(ts, .x)) %>% 
  reduce(left_join, by = "Year") %>% 
  left_join(ts, ., by = "Year")

输出(这里只打印一小部分)

dim(results)
# [1] 2016 4223

results[60:70, 1:15]

   Year    TS av50.10yr   av50.11yr  av50.12yr  av50.13yr av50.14yr av50.15yr av50.16yr  av50.17yr  av50.18yr av50.19yr av50.20yr av50.21yr av50.22yr
60   60  0.63    -0.080          NA         NA         NA        NA        NA        NA         NA         NA        NA        NA        NA        NA
61   61 -1.50     0.123 -0.01727273         NA         NA        NA        NA        NA         NA         NA        NA        NA        NA        NA
62   62 -0.07     0.295  0.15545455 0.02416667         NA        NA        NA        NA         NA         NA        NA        NA        NA        NA
63   63 -0.84     0.338  0.30000000 0.17166667 0.04923077        NA        NA        NA         NA         NA        NA        NA        NA        NA
64   64 -0.36     0.317  0.39363636 0.35416667 0.23153846 0.1135714        NA        NA         NA         NA        NA        NA        NA        NA
65   65  0.34     0.442  0.29000000 0.36250000 0.32846154 0.2164286 0.1073333        NA         NA         NA        NA        NA        NA        NA
66   66 -0.45     0.312  0.35363636 0.22166667 0.29384615 0.2671429 0.1666667  0.067500         NA         NA        NA        NA        NA        NA
67   67 -1.17     0.415  0.25000000 0.29333333 0.17615385 0.2464286 0.2246667  0.133125 0.04176471         NA        NA        NA        NA        NA
68   68 -0.77     0.355  0.42363636 0.27166667 0.31000000 0.2000000 0.2640000  0.242500 0.15529412 0.06777778        NA        NA        NA        NA
69   69  0.51     0.477  0.45090909 0.50583333 0.35923077 0.3885714 0.2806667  0.335625 0.31117647 0.22500000 0.1384211        NA        NA        NA
70   70 -0.25     0.331  0.42272727 0.40333333 0.45769231 0.3250000 0.3546667  0.255625 0.30882353 0.28722222 0.2068421    0.1255        NA        NA

数据

set.seed(28)
ts <- data.frame(Year = c(1:2016),
                  TS = sample(
                    seq(from = -1.5, to = 1.5, by = 0.01),
                    size = 2016,
                    replace = TRUE
                  ))