broom::augment 面板混合模型太慢
broom::augment too slow with panel mixed model
我想问一下,
在做混合面板数据模型时,然后使用 broom::augment
检索完整的回归输出
相对于随机、固定效应等不同的模型,它非常慢
示例:
#Load packages
library(foreign)
library(plm)
library(stargazer)
library(dplyr)
library(tidyverse)
library(quantreg)
#Load in Wooldridge data on crime
crime <- read.dta("http://fmwww.bc.edu/ec-p/data/wooldridge/crime4.dta")
#Declare our data to be a panel data set
crime.p <- pdata.frame(crime,index=c("county","year"))
crime.p %>% head
panel1 <- plm(crmrte ~ polpc + prbconv + avgsen + density, data = crime.p, model = "within")
panel2 <- plm(crmrte ~ polpc + prbconv + avgsen + density, data = crime.p, model = "random")
panel3 <- plm(crmrte ~ polpc + prbconv + avgsen + density, data = crime.p, model = "pool")
panel4 <- rq(crmrte ~ polpc + prbconv + avgsen + density + factor(county)-1, data = crime.p)
panel5 <- lmer(crmrte ~ polpc + prbconv + avgsen + density + (1 | county), data = crime.p)
broom::augment(panel1)
broom::augment(panel2)
broom::augment(panel4)
# This one is very slow
broom::augment(panel5)
这很慢,但是对于更大的数据集,它甚至可能需要 20 分钟,
有没有一种方法可以检索 broom::augment(mixed) 但更快的输出?
panel5
在我的机器上并不是特别慢(~0.01 秒)。您可以通过分别计算每一列和时间来查看花费最多时间的地方:
我创建了一个自定义函数来计算 augment
结果的每个元素:
fe <- function(panel){
f <- fixef(panel)
int <- f[1]
coef <- f[-1]
d <- panel5@frame
d1 <- d[names(f) %in% names(d)]
int + rowSums(mapply(`*`, d1, coef))
}
augment2 <- function(panel) {
d <- panel@frame
p <- predict(panel)
r <- resid(panel)
hat <- hatvalues(panel)
cook <- cooks.distance(panel)
fe <- fe(panel)
mu <- panel@resp$mu
wr <- panel@resp$wtres
cbind(d,fitted=p,redidual = r,hat,cooksd=cook,fixed=fe,mu,wtres=wr)
}
如果我们对两者进行计时,它们非常相似,因为我所做的只是计算 augment
计算的相同内容:
microbenchmark::microbenchmark(
augment = broom::augment(panel5),
cust = augment2(panel5)
)
#Unit: milliseconds
# expr min lq mean median uq max neval
# augment 12.332313 12.85186 16.83977 13.31599 15.69942 59.84045 100
# cust 9.976812 10.38016 11.94285 10.66314 11.32606 43.50708 100
如果我们从自定义的augment2
函数中依次取出每一列,我们可以看到大部分时间都花在了计算hatvalues
和cooks.distance
上。如果您真的不需要这些列中的一个或两个,它会显着加快您的计算速度。排除两者,我们可以看到速度提高了 25 倍:
augment3 <- function(panel) {
d <- panel@frame
p <- predict(panel)
r <- resid(panel)
#hat <- hatvalues(panel)
#cook <- cooks.distance(panel)
fe <- fe(panel)
mu <- panel@resp$mu
wr <- panel@resp$wtres
cbind(d,fitted=p,redidual = r,fixed=fe,mu,wtres=wr)
}
microbenchmark::microbenchmark(
augment = broom::augment(panel5),
cust = augment3(panel5)
)
#Unit: microseconds
# expr min lq mean median uq max neval
# augment 12549.486 12924.666 17760.754 13341.712 15480.9555 87276.359 100
# cust 406.818 474.119 694.698 532.005 586.5685 4702.398 100
基本上花费的时间是由于 cooks.distance
和 hat.values
,我无法想象可以比现在更有效地计算它们。
我想问一下,
在做混合面板数据模型时,然后使用 broom::augment
相对于随机、固定效应等不同的模型,它非常慢
示例:
#Load packages
library(foreign)
library(plm)
library(stargazer)
library(dplyr)
library(tidyverse)
library(quantreg)
#Load in Wooldridge data on crime
crime <- read.dta("http://fmwww.bc.edu/ec-p/data/wooldridge/crime4.dta")
#Declare our data to be a panel data set
crime.p <- pdata.frame(crime,index=c("county","year"))
crime.p %>% head
panel1 <- plm(crmrte ~ polpc + prbconv + avgsen + density, data = crime.p, model = "within")
panel2 <- plm(crmrte ~ polpc + prbconv + avgsen + density, data = crime.p, model = "random")
panel3 <- plm(crmrte ~ polpc + prbconv + avgsen + density, data = crime.p, model = "pool")
panel4 <- rq(crmrte ~ polpc + prbconv + avgsen + density + factor(county)-1, data = crime.p)
panel5 <- lmer(crmrte ~ polpc + prbconv + avgsen + density + (1 | county), data = crime.p)
broom::augment(panel1)
broom::augment(panel2)
broom::augment(panel4)
# This one is very slow
broom::augment(panel5)
这很慢,但是对于更大的数据集,它甚至可能需要 20 分钟, 有没有一种方法可以检索 broom::augment(mixed) 但更快的输出?
panel5
在我的机器上并不是特别慢(~0.01 秒)。您可以通过分别计算每一列和时间来查看花费最多时间的地方:
我创建了一个自定义函数来计算 augment
结果的每个元素:
fe <- function(panel){
f <- fixef(panel)
int <- f[1]
coef <- f[-1]
d <- panel5@frame
d1 <- d[names(f) %in% names(d)]
int + rowSums(mapply(`*`, d1, coef))
}
augment2 <- function(panel) {
d <- panel@frame
p <- predict(panel)
r <- resid(panel)
hat <- hatvalues(panel)
cook <- cooks.distance(panel)
fe <- fe(panel)
mu <- panel@resp$mu
wr <- panel@resp$wtres
cbind(d,fitted=p,redidual = r,hat,cooksd=cook,fixed=fe,mu,wtres=wr)
}
如果我们对两者进行计时,它们非常相似,因为我所做的只是计算 augment
计算的相同内容:
microbenchmark::microbenchmark(
augment = broom::augment(panel5),
cust = augment2(panel5)
)
#Unit: milliseconds
# expr min lq mean median uq max neval
# augment 12.332313 12.85186 16.83977 13.31599 15.69942 59.84045 100
# cust 9.976812 10.38016 11.94285 10.66314 11.32606 43.50708 100
如果我们从自定义的augment2
函数中依次取出每一列,我们可以看到大部分时间都花在了计算hatvalues
和cooks.distance
上。如果您真的不需要这些列中的一个或两个,它会显着加快您的计算速度。排除两者,我们可以看到速度提高了 25 倍:
augment3 <- function(panel) {
d <- panel@frame
p <- predict(panel)
r <- resid(panel)
#hat <- hatvalues(panel)
#cook <- cooks.distance(panel)
fe <- fe(panel)
mu <- panel@resp$mu
wr <- panel@resp$wtres
cbind(d,fitted=p,redidual = r,fixed=fe,mu,wtres=wr)
}
microbenchmark::microbenchmark(
augment = broom::augment(panel5),
cust = augment3(panel5)
)
#Unit: microseconds
# expr min lq mean median uq max neval
# augment 12549.486 12924.666 17760.754 13341.712 15480.9555 87276.359 100
# cust 406.818 474.119 694.698 532.005 586.5685 4702.398 100
基本上花费的时间是由于 cooks.distance
和 hat.values
,我无法想象可以比现在更有效地计算它们。