具有许多 'fixed effects' / 个别特定截距的大型面板的良好包装

Question

我在 plm() 这样的小数据集中取得了很好的经验：

library("data.table")
library("plm")
set.seed(123)
smalldata <- data.table(id = rep(1:100, each = 10), x = rnorm(1000))
smalldata[, y := x + 0.3*x + (rnorm(1000)/100)]

plm(smalldata,
    formula = y ~ x,
    effect = c("individual"),
    model = c("within"),
    index = "id")

但是如果我有一个更大的数据集，我很快就会运行进入 ram 问题并且事情变得非常（！）慢

set.seed(123)
largedata <- data.table(id = rep(1:100000, each = 10), x = rnorm(1000000))
largedata[, y := x + 0.3*x + (rnorm(1000000)/100)]

time_pre <- Sys.time()
plm(largedata,
    formula = y ~ x,
    effect = c("individual"),
    model = c("within"),
    index = "id")
time_post <- Sys.time()
time_post-time_pre

这在我的机器上已经花费了超过一分钟的时间，而且我现实生活中的数据集远大于 100 万个观察值。如果您想尝试一下，只需添加几个 0，然后等到您的 RAM 填满并且代码不再运行为止。当然，随着更多级别的固定效应的添加，问题变得更糟（您可以在 plm 中使用 as.factor() 虚拟回归来做到这一点）等 pp 但问题已经在这个简单的示例中可见。

这种缓慢也很奇怪，因为 plm() 文档告诉我它在应该很快的转换中使用。他们一定是使用复杂的函数来做到这一点，因为这里是同一个 'fixed effects' 面板，在转换中使用 data.table() 然后 lm():

time_pre_dt <- Sys.time()
largedata[, x_demeaned := x - mean(x), by = id]
largedata[, y_demeaned := y - mean(y), by = id]
lm(largedata, formula = y_demeaned ~ -1 + x_demeaned)
time_post_dt <- Sys.time()
time_post_dt - time_pre_dt
as.double(time_post-time_pre, unit = "secs")/as.double(time_post_dt - time_pre_dt, unit = "secs")

在我的机器上，速度提高了 70-125 倍。

所以更快面板的成分已经存在，但 plm() 似乎没有利用它们。是否还有其他面板包对大型 'fixed effects' 面板更有用？我对解决方法犹豫不决（比如在转换中），因为我并不总是确定在不同版本的 coeftest() 等中如何计算标准误差（Stata 中的相同问题，样本调整等）。我正在寻找开发的软件包，并希望避免从头开始编写所有代码。

编辑：

为了使用下面@Helix123 的答案进行基准测试，这里尝试使用 plm() 'fast':

install.packages("collapse")
install.packages("fixest")
install.packages("lfe")

# (restart R session)

options("plm.fast" = TRUE)

time_pre_fast <- Sys.time()
plm(largedata,
    formula = y ~ x,
    effect = c("individual"),
    model = c("within"),
    index = "id")
time_post_fast <- Sys.time()

time_post_fast - time_pre_fast

as.double(time_post_fast - time_pre_fast, unit = "secs")/as.double(time_post_dt - time_pre_dt, unit = "secs")

内部转换示例在我的机器上仍然快 90 倍。然而，felm() 确实大大加快了速度！这里的例子：

library("lfe")

time_pre_felm <- Sys.time()
felm(largedata, formula = y ~ x | id)
time_post_felm <- Sys.time()

as.double(time_post_felm - time_pre_felm, unit = "secs")/as.double(time_post_dt - time_pre_dt, unit = "secs")

这只需要 data.table() 内部转换的两倍，然后是 lm()

feols() 最固定的包是赢家，请参阅下面的答案。

Answer 1

edit: plm 版本 2.6-0 默认启用快速模式。如果还安装了软件包 fixest（或 lfe），则可以进一步加快 plm 的运行速度。对于 ?plm.fast 中的基准测试示例，这提供了高达 28 倍的加速。

旧答案 自 plm 2.4-0 版本以来，有一个可选的快速模式，请参阅 2.4-0 和 2.4-2 的新闻条目：https://cran.r-project.org/web/packages/plm/news/news.html

要启用快速模式，请设置此选项：

options("plm.fast" = TRUE)

您可以在脚本或 .Rprofile 文件中手动设置此选项。

您还需要安装包 collapse，可选包 fixest 或 lfe 以在某些情况下进一步加快速度，尤其是。从 plm 2.4-2 开始的双向固定效应。

有关基准示例，请参阅 ?plm.fast 的帮助。

在您的示例中，创建 pdata.frame（plm 使用的数据格式）花费了大量时间，如果未首先显式转换数据，它会在对 plm 的调用中隐式发生。因此，可以将格式拆分为 pdata.frame 和模型估计。因此，plm() 的执行时间明显更短。此外，这将加快数据集上第一个模型之后的任何进一步模型估计。

# convert data set to a pdata.frame:
plargedata <- pdata.frame(largedata, index = "id")

# use the pdata.frame for model estimation:
plm(plargedata,
     formula = y ~ x,
     effect = c("individual"),
     model = c("within"))

您还可以查看软件包 fixest 和 lfe 以获得固定效应估计；它们的语法略有不同。如果您不想更改代码，plm 的快速模式应该可以帮助您。但是，如果您有两个以上的固定效应，则需要将它们包含在 plm 的公式中 +factor(fe3) + factor(fe4) + ... 并且这些情况可能会更快 lfe 和 fixest 如果有很多具有多个级别的固定效果。

Answer 2

根据@Helix123 的提示查看 fixst 包，答案如下：

library("fixest")

time_pre_fixest <- Sys.time()
feols(y ~ x | id, largedata)
time_post_fixest <- Sys.time()

as.double(time_post_fixest - time_pre_fixest, unit = "secs")/as.double(time_post_dt - time_pre_dt, unit = "secs")

用我问题中最快版本的十分之一。

编辑：当然，您仍然可以运行解决 RAM 问题，但这种可能性要小得多。

具有许多 'fixed effects' / 个别特定截距的大型面板的良好包装

Good packages for large panels with many 'fixed effects' / individual specific intercepts

r

data.table

plm