等效于 R 中用于蒙特卡罗模拟的 Stata 命令“模拟”

Question

我正在 R 中搜索极其方便的 Stata 命令的等效函数 simulate。该命令基本上允许您声明一个 program（在下面的示例中为 reg_simulation），然后从 simulate 调用这样的程序并存储所需的输出。

下面是 simulate 程序用法的 Stata 说明，以及我尝试使用 R 复制它的尝试。

最后，我的主要问题是：R 用户将如何运行进行蒙特卡洛模拟？ 还是我在结构或速度瓶颈方面遗漏了什么？非常感谢您。

Stata 示例

正在定义 reg_simulation 程序。

clear all
*Define "reg_simulation" to be used later on by "simulate" command 
program reg_simulation, rclass
    *Declaring Stata version
    version 13
    *Droping all variables on memory
    drop _all
    *Set sample size (n=100)
    set obs 100
    *Simulate model
    gen x1 = rnormal()
    gen x2 = rnormal()
    gen y = 1 + 0.5 * x1 + 1.5 *x2 + rnormal()
    *Estimate OLS
    reg y x1 x2 
    *Store coefficients
    matrix B = e(b)
    return matrix betas = B 
end

从 simulate 命令调用 reg_simulation：

*Seet seed
set seed 1234
*Run the actual simulation 10 times using "reg_simulation"
simulate , reps(10) nodots: reg_simulation

获得的结果（内存中存储的数据）

_b_x1   _b_x2   _b_cons
.4470155    1.50748     1.043514
.4235979    1.60144     1.048863
.5006762    1.362679    .8828927
.5319981    1.494726    1.103693
.4926634    1.476443    .8611253
.5920001    1.557737    .8391003
.5893909    1.384571    1.312495
.4721891    1.37305     1.017576
.7109139    1.47294     1.055216
.4197589    1.442816    .9404677

R 复制上面的 Stata 程序。

使用 R 我已经设法获得以下内容（不是 R 专家）。然而，最让我担心的部分是 for-loop 结构，它循环遍历每个重复次数 nreps.

正在定义 reg_simulation 函数。

#Defining a function 
reg_simulation<- function(obs = 1000){
    data <- data.frame(
    #Generate data
    x1 <-rnorm(obs, 0 , 1) ,
    x2 <-rnorm(obs, 0 , 1) ,
    y <- 1 + 0.5* x1 + 1.5 * x2 + rnorm(obs, 0 , 1) )
  #Estimate OLS
  ols <- lm(y ~ x1 + x2, data=data)  
  return(ols$coefficients)  
}

使用 for 循环结构调用 reg_simulation 10 次：

#Generate list to store results from simulation
results_list <- list()
# N repetitions
nreps <- 10
for (i in 1:nreps) {
  #Set seed internally (to get different values in each run)
  set.seed(i)
  #Save results into list
  results_list[i]  <- list(reg_simulation(obs=1000))  
}
#unlist results
df_results<- data.frame(t(sapply(results_list, 
                       function(x) x[1:max(lengths(results_list))])))

得到的结果：df_results.

#final results
df_results
#   X.Intercept.  x1        x2
# 1     1.0162384 0.5490488 1.522017
# 2     1.0663263 0.4989537 1.496758
# 3     0.9862365 0.5144083 1.462388
# 4     1.0137042 0.4767466 1.551139
# 5     0.9996164 0.5020535 1.489724
# 6     1.0351182 0.4372447 1.444495
# 7     0.9975050 0.4809259 1.525741
# 8     1.0286192 0.5253288 1.491966
# 9     1.0107962 0.4659812 1.505793
# 10    0.9765663 0.5317318 1.501162

Answer 1

因此，根据评论，您想要改变自变量 (x) 以及误差项并模拟系数，但您还想在出现任何错误时捕获错误。以下内容可以解决问题：

set.seed(42)
#Defining a function 
reg_simulation<- function(obs = 1000){

    data <- data.frame(
    #Generate data
    x1 <-rnorm(obs, 0 , 1) ,
    x2 <-rnorm(obs, 0 , 1) ,
    y <- 1 + 0.5* x1 + 1.5 * x2 + rnorm(obs, 0 , 1) )
  #Estimate OLS

    tryCatch(
      {
        ols <- lm(y ~ x1 + x2, data=data)  
        return(ols$coefficients)      
      }, 
      error = function(e){
              return(c('(Intercept)'=NA, 'x1'=NA, 'x2'=NA))
      }
    )
    
}
output <- t(data.frame(replicate(10, reg_simulation())))
output

    (Intercept)        x1       x2
X1    0.9961328 0.4782010 1.481712
X2    1.0234698 0.4801982 1.556393
X3    1.0336289 0.5239380 1.435468
X4    0.9796523 0.5095907 1.493548
...

此处，tryCatch（另请参见failwith）捕获错误并将returns NA 作为默认值。

请注意，您只需设置一次种子，因为每次调用随机数生成器时种子都会以确定性方式自动更改。

Answer 2

你走在正确的轨道上。一对hints/corrections:

不要在data.frame()

<-

在 R 中，我们使用 = 来构造数据框以进行内部列分配，即 data.frame(x = 1:10, y = 11:20) 而不是 data.frame(x <- 1:10, y <- 11:20)。

（有 more to be said 关于 <- 与 =，但我不想分散您对主要问题的注意力。）

在您的情况下，您实际上甚至不需要创建数据框，因为 x1、x2 和 y 都将被识别为函数的范围。我将 post 在我的回答末尾提供一些代码来证明这一点。

在 R 中通过 for 循环增长列表时，总是先尝试预分配列表

如果您要增加（长）for 循环，请始终尝试预先分配列表长度和类型。原因：这样，R 就知道要有效地为您的对象分配多少内存。在你只做 10 次的情况下，这意味着从这样的事情开始：

results_list <- vector("list", 10)

3。考虑使用 lapply 而不是 for

for 循环在 R 中有一些不好的代表。（有点不公平，但这是另一天的故事。）许多 R 用户会考虑的替代方案是功能lapply 提供的编程方法。我暂时不向您展示代码，但它看起来与 for 循环非常相似。请快速注意，从第 2 点开始，一个直接的好处是您不需要使用 lapply.

预先分配列表

4. 运行并行大循环

A Monte Carlo 模拟是运行一切并行的理想选择，因为每个迭代都应该独立于其他迭代。在 R 中并行的一种简单方法是通过 future.apply 包。

将所有内容放在一起，这就是我可能会如何进行模拟。请注意，这可能比您可能需要的更“高级”，但既然我在这里...

library(data.table)   ## optional, but what I'll use to coerce the list into a DT
library(future.apply) ## for parallel stuff
plan(multisession)    ## use all available cores

obs <- 1e3

# Defining a function 
reg_simulation <- function(...){
    x1 <- rnorm(obs, 0 , 1)
    x2 <- rnorm(obs, 0 , 1)
    y <- 1 + 0.5* x1 + 1.5 * x2 + rnorm(obs, 0 , 1)
    #Estimate OLS
    ols <- lm(y ~ x1 + x2)  
    
    # return(ols$coefficients)
    return(as.data.frame(t(ols$coefficients)))
}

# N repetitions
nreps <- 10

## Serial version
# results  <- lapply(1:nreps, reg_simulation)

## Parallel version
results  <- future_lapply(1:nreps, reg_simulation, future.seed = 1234L)

## Unlist / convert into a data.table
results <- rbindlist(results)

等效于 R 中用于蒙特卡罗模拟的 Stata 命令“模拟”

Equivalent of Stata command `simulate` in R for Montecarlo Simulation

random

simulation

r

montecarlo

stata

Stata 示例

R 复制上面的 Stata 程序。