在 dplyr 中模拟时间序列而不是使用 for 循环

Simulating a timeseries in dplyr instead of using a for loop

因此,虽然 dplyr 中的 laglead 很棒,但我想模拟诸如人口增长之类的时间序列。我以前的学校代码看起来像:

tdf <- data.frame(time=1:5, pop=50)
for(i in 2:5){
  tdf$pop[i] = 1.1*tdf$pop[i-1]
}

产生

  time    pop
1    1 50.000
2    2 55.000
3    3 60.500
4    4 66.550
5    5 73.205

我觉得必须有一个 dplyrtidyverse 的方法来做到这一点(尽管我很喜欢我的 for 循环)。

但是,像

tdf <- data.frame(time=1:5, pop=50) %>%
  mutate(pop = 1.1*lag(pop))

这本来是我的第一个猜测刚产生

  time pop
1    1  NA
2    2  55
3    3  55
4    4  55
5    5  55

我觉得我遗漏了一些明显的东西....那是什么?

注意 - 这是一个微不足道的例子 - 我的真实例子使用了多个参数,其中很多是时变的(我在不同的 GCM 场景下模拟预测),因此,tidyverse 被证明是一个强大的工具将我的模拟结合在一起。

如果 pop 的起始值为 50,则 pop = 50 * 1.1^(0:4) 将为您提供接下来的四个值。使用您的代码,您可以:

data.frame(time=1:5, pop=50) %>%
  mutate(pop = pop * 1.1^(1:n() - 1))

或者,

base = 50

data.frame(time=1:5) %>%
  mutate(pop = base * 1.1^(1:n()-1))

地图函数呢,即

tdf <- data_frame(time=1:5)
tdf %>% mutate(pop = map_dbl(.x = tdf$time, .f = (function(x) 50*1.1^x)))

这里的问题是 dplyr 是 运行 这是一组矢量运算,而不是一次评估一个术语。这里,1.1*lag(pop) 被解释为 "calculate the lagged values for all of pop, then multiple them all by 1.1"。因为你 set pop=50 所有步骤的滞后值都是 50。

dplyr 确实有一些用于顺序评估的辅助函数;标准函数 cumsumcumprod 等有效,一些新函数(参见 ?cummean)都在 dplyr 中有效。在您的示例中,您可以使用以下方法模拟模型:

tdf <- data.frame(time=1:5, pop=50, growth_rate = c(1, rep(1.1,times=4)) %>%
    mutate(pop = pop*cumprod(growth_rate))


time    pop     growth_rate
1       50.000  1.0
2       55.000  1.1
3       60.500  1.1
4       66.550  1.1
5       73.205  1.1

注意我这里加了growth rate这一列,第一个growth rate设置为1,你也可以这样指定:

tdf <- data.frame(time=1:5, pop=50, growth_rate = 1.1) %>%
    mutate(pop = pop*cumprod(lead(growth_rate,default=1))

这就明确了growth rate列指的是当前时间步长相对于上一个时间步长的增长率。

您可以通过这种方式进行多少次不同的模拟,但使用列中指定的累积函数和参数的某种组合来构建大量离散时间生态模型应该是可行的。

Reduce(或者它的 purrr 变体,如果你喜欢的话)是你想要的累积函数,它还没有 cum* 版本:

data.frame(time = 1:5, pop = 50) %>%
    mutate(pop = Reduce(function(x, y){x * 1.1}, pop, accumulate = TRUE))

##   time    pop
## 1    1 50.000
## 2    2 55.000
## 3    3 60.500
## 4    4 66.550
## 5    5 73.205

或发出呼噜声,

data.frame(time = 1:5, pop = 50) %>%
    mutate(pop = accumulate(pop, ~.x * 1.1))

##   time    pop
## 1    1 50.000
## 2    2 55.000
## 3    3 60.500
## 4    4 66.550
## 5    5 73.205

Purrr 的 accumulate 函数可以处理时变索引,如果你传递它们的话 将您的模拟函数作为包含所有参数的列表。但是,要使其正常工作需要一些争论。这里的技巧是 accumulate() 可以在列表和向量列上工作。您可以使用 tidyr 函数 nest() 将列分组到包含当前人口状态和参数的列表向量中,然后在生成的列表列上使用 accumulate()。这解释起来有点复杂,所以我提供了一个演示,以恒定增长率或随时间变化的随机增长率模拟逻辑增长。我还提供了一个示例,说明如何使用它来使用 dpylr+purrr+tidyr 模拟给定模型的多个复制。

library(dplyr)
library(purrr)
library(ggplot2)
library(tidyr)

# Declare the population growth function. Note: the first two arguments
# have to be .x (the prior vector of populations and parameters) and .y,
# the current parameter value and population vector. 
# This example function is a Ricker population growth model. 
logistic_growth = function(.x, .y, growth, comp) {
  pop = .x$pop[1]
  growth = .y$growth[1]
  comp  = .y$comp[1]
  # Note: this uses the state from .x, and the parameter values from .y.
  # The first observation will use the first entry in the vector for .x and .y
  new_pop = pop*exp(growth - pop*comp)
  .y$pop[1] = new_pop
  return(.y)
}

# Starting parameters the number of time steps to simulate, initial population size,
# and ecological parameters (growth rate and intraspecific competition rate)
n_steps  = 100
pop_init = 1
growth = 0.5
comp = 0.05

#First test: fixed growth rates
test1 = data_frame(time = 1:n_steps,pop = pop_init, 
                   growth=growth,comp =comp)


# here, the combination of nest() and group_by() split the data into individual 
# time points and then groups all parameters into a new vector called state.
# ungroup() removes the grouping structure, then accumulate runs the function
#on the vector of states. Finally unnest transforms it all back to a
#data frame
out1 = test1 %>%
  group_by(time)%>%
  nest(pop, growth, comp,.key = state)%>%
  ungroup()%>%
  mutate(
    state = accumulate(state,logistic_growth))%>%
  unnest()

# This is the same example, except I drew the growth rates from a normal distribution
# with a mean equal to the mean growth rate and a std. dev. of 0.1
test2 = data_frame(time = 1:n_steps,pop = pop_init, 
                   growth=rnorm(n_steps, growth,0.1),comp=comp)

out2 = test2 %>%
  group_by(time)%>%
  nest(pop, growth, comp,.key = state)%>%
  ungroup()%>%
  mutate(
    state = accumulate(state,logistic_growth))%>%
  unnest()

# This demostrates how to use this approach to simulate replicates using dplyr
# Note the crossing function creates all combinations of its input values
test3 = crossing(rep = 1:10, time = 1:n_steps,pop = pop_init, comp=comp) %>%
  mutate(growth=rnorm(n_steps*10, growth,0.1))

out3 = test3 %>%
  group_by(rep)%>%
  group_by(rep,time)%>%
  nest(pop, growth, comp,.key = state)%>%
  group_by(rep)%>%
  mutate(
    state = accumulate(state,logistic_growth))%>%
  unnest()

print(qplot(time, pop, data=out1)+
        geom_line() +
        geom_point(data= out2, col="red")+
        geom_line(data=out2, col="red")+
        geom_point(data=out3, col="red", alpha=0.1)+
        geom_line(data=out3, col="red", alpha=0.1,aes(group=rep)))