如何使用 R 运行 预测面板数据中个体固定效应的概率(或平均边际效应)?
How to run the predicted probabilities (or average marginal effects) for individuals fixed effects in panel data using R?
这些是 运行 提供或多或少相同结果的单个固定效应方法的三种不同方法(见下文)。我的主要问题是如何使用第二个模型(model_plm
)或第三个模型(model_felm
)获得预测概率或平均边际效应。我知道如何使用第一个模型 (model_lm
) 并在下面显示一个使用 ggeffects
的示例,但这仅在我有一个小样本时有效。
因为我有超过一百万个人,我的模型只能使用 model_plm
和 model_felm
。如果我使用 model_lm
,运行 需要很多时间才能处理一百万个人,因为他们在模型中受到控制。我还收到以下错误:Error: vector memory exhausted (limit reached?)
。我检查了 Whosebug 上的许多线程以解决该错误,但似乎无法解决它。
我想知道是否有解决此问题的有效方法。我的主要兴趣是提取相互作用的预测概率 residence*union
。我通常使用以下软件包之一提取预测概率或平均边际效应:ggeffects
、emmeans
或 margins
.
library(lfe)
library(plm)
library(ggeffects)
data("Males")
model_lm = lm(wage ~ exper + residence+health + residence*union +factor(nr)-1, data=Males)
model_plm = plm(wage ~ exper + residence + health + residence*union,model = "within", index=c("nr", "year"), data=Males)
model_felm = felm(wage ~ exper + residence + health + residence*union | nr, data= Males)
pred_ggeffects <- ggpredict(model_lm, c("residence","union"),
vcov.fun = "vcovCL",
vcov.type = "HC1",
vcov.args = list(cluster = Males$nr))
这个潜在的解决方案使用 biglm::biglm()
来拟合 lm 模型,然后使用 emmeans::qdrg()
并指定一个麻烦。这种方法对您的情况有帮助吗?
library(biglm)
library(emmeans)
## the biglm coefficients using factor() with all the `nr` levels has NAs.
## so restrict data to complete cases in the `biglm()` call
model_biglm <- biglm(wage ~ -1 +exper + residence+health + residence*union + factor(nr), data=Males[!is.na(Males$residence),])
summary(model_biglm)
## double check that biglm and lm give same/similar model
## summary(model_biglm)
## summary(model_lm)
summary(model_biglm)$rsq
summary(model_lm)$r.squared
identical(coef(model_biglm), coef(model_lm)) ## not identical! but plot the coefficients...
head(cbind(coef(model_biglm), coef(model_lm)))
tail(cbind(coef(model_biglm), coef(model_lm)))
plot(cbind(coef(model_biglm), coef(model_lm))); abline(0,1,col="blue")
## do a "[q]uick and [d]irty [r]eference [g]rid and follow examples
### from ?qdrg and https://cran.r-project.org/web/packages/emmeans/vignettes/FAQs.html
rg1 <- qdrg(wage ~ -1 + exper + residence+health + residence*union + factor(nr),
data = Males,
coef = coef(model_biglm),
vcov = vcov(model_biglm),
df = model_biglm$df.resid,
nuisance="nr")
## Since we already specified nuisance in qdrg() we don't in emmeans():
emmeans(rg1, c("residence","union"))
给出:
> emmeans(rg1, c("residence","union"))
residence union emmean SE df lower.CL upper.CL
rural_area no 1.72 0.1417 2677 1.44 2.00
north_east no 1.67 0.0616 2677 1.55 1.79
nothern_central no 1.53 0.0397 2677 1.45 1.61
south no 1.60 0.0386 2677 1.52 1.68
rural_area yes 1.63 0.2011 2677 1.23 2.02
north_east yes 1.72 0.0651 2677 1.60 1.85
nothern_central yes 1.67 0.0503 2677 1.57 1.77
south yes 1.68 0.0460 2677 1.59 1.77
Results are averaged over the levels of: 1 nuisance factors, health
Confidence level used: 0.95
我尝试调整 formula/datasets 让 emmeans 和 plm 发挥得更好。让我知道这里是否有东西。经过一些测试后,我意识到 biglm 答案不会为一百万人削减它。
library(emmeans)
library(plm)
data("Males")
## this runs but we need to get an equivalent result with expanded formula
## and expanded dataset
model_plm = plm(wage ~ exper + residence + health + residence*union,model = "within", index=c("nr"), data=Males)
## expanded dataset
Males2 <- data.frame(wage=Males[complete.cases(Males),"wage"],
model.matrix(wage ~ exper + residence + health + residence*union, Males),
nr=Males[complete.cases(Males),"nr"])
(fmla2 <- as.formula(paste("wage ~ ", paste(names(coef(model_plm)), collapse= "+"))))
## expanded formula
model_plm2 <- plm(fmla2,
model = "within",
index=c("nr"),
data=Males2)
(fmla2_rg <- as.formula(paste("wage ~ -1 +", paste(names(coef(model_plm)), collapse= "+"))))
plm2_rg <- qdrg(fmla2_rg,
data = Males2,
coef = coef(model_plm2),
vcov = vcov(model_plm2),
df = model_plm2$df.residual)
plm2_rg
### when all 3 residences are 0, that's `rural area`
### then just pick the rows when one of the residences are 1
emmeans(plm2_rg, c("residencenorth_east","residencenothern_central","residencesouth", "unionyes"))
在删除一些行后给出:
> ### when all 3 residences are 0, that's `rural area`
> ### then just pick the rows when one of the residences are 1
> emmeans(plm2_rg, c("residencenorth_east","residencenothern_central","residencesouth", "unionyes"))
residencenorth_east residencenothern_central residencesouth unionyes emmean SE df lower.CL upper.CL
0 0 0 0 0.3777 0.0335 2677 0.31201 0.443
1 0 0 0 0.3301 0.1636 2677 0.00929 0.651
0 1 0 0 0.1924 0.1483 2677 -0.09834 0.483
0 0 1 0 0.2596 0.1514 2677 -0.03732 0.557
0 0 0 1 0.2875 0.1473 2677 -0.00144 0.576
1 0 0 1 0.3845 0.1647 2677 0.06155 0.708
0 1 0 1 0.3326 0.1539 2677 0.03091 0.634
0 0 1 1 0.3411 0.1534 2677 0.04024 0.642
Results are averaged over the levels of: healthyes
Confidence level used: 0.95
问题似乎是,当我们将 -1
添加到公式时,会在模型矩阵中创建一个额外的列,该列未包含在回归系数中。 (这是 R 创建因子编码的方式的副产品。)
所以我可以通过添加一个战略性的零系数来解决这个问题。我们还必须以同样的方式修正协方差矩阵:
library(emmeans)
library(plm)
data("Males")
mod <- plm(wage ~ exper + residence + health + residence*union,
model = "within",
index = "nr",
data = Males)
BB <- c(coef(mod)[1], 0, coef(mod)[-1])
k <- length(BB)
VV <- matrix(0, nrow = k, ncol = k)
VV[c(1, 3:k), c(1, 3:k)] <- vcov(mod)
RG <- qdrg(~ -1 + exper + residence + health + residence*union,
data = Males, coef = BB, vcov = VV, df = df.residual(mod))
验证一切是否一致:
> names(RG@bhat)
[1] "exper" ""
[3] "residencenorth_east" "residencenothern_central"
[5] "residencesouth" "healthyes"
[7] "unionyes" "residencenorth_east:unionyes"
[9] "residencenothern_central:unionyes" "residencesouth:unionyes"
> colnames(RG@linfct)
[1] "exper" "residencerural_area"
[3] "residencenorth_east" "residencenothern_central"
[5] "residencesouth" "healthyes"
[7] "unionyes" "residencenorth_east:unionyes"
[9] "residencenothern_central:unionyes" "residencesouth:unionyes"
他们确实在排队,所以我们可以得到我们需要的结果:
(EMM <- emmeans(RG, ~ residence * union))
residence union emmean SE df lower.CL upper.CL
rural_area no 0.378 0.0335 2677 0.31201 0.443
north_east no 0.330 0.1636 2677 0.00929 0.651
nothern_central no 0.192 0.1483 2677 -0.09834 0.483
south no 0.260 0.1514 2677 -0.03732 0.557
rural_area yes 0.287 0.1473 2677 -0.00144 0.576
north_east yes 0.385 0.1647 2677 0.06155 0.708
nothern_central yes 0.333 0.1539 2677 0.03091 0.634
south yes 0.341 0.1534 2677 0.04024 0.642
Results are averaged over the levels of: health
Confidence level used: 0.95
一般来说,关键是确定添加的列出现的位置。这将是模型公式中第一个因素的第一级的位置。您可以通过查看 names(coef(mod))
和 colnames(model.matrix(formula), data = data)
来检查它,其中 formula
是删除截距的模型公式。
更新:通用函数
这是一个可用于为任何 plm
对象创建参考网格的函数。事实证明,有时这些对象 do 有一个截距(例如,随机效应模型)所以我们必须检查。对于缺少截距的模型,您真的应该仅将其用于对比。
plmrg = function(object, ...) {
form = formula(formula(object))
if (!("(Intercept)" %in% names(coef(object))))
form = update(form, ~ . - 1)
data = eval(object$call$data, environment(form))
mmat = model.matrix(form, data)
sel = which(colnames(mmat) %in% names(coef(object)))
k = ncol(mmat)
b = rep(0, k)
b[sel] = coef(object)
v = matrix(0, nrow = k, ncol = k)
v[sel, sel] = vcov(object)
emmeans::qdrg(formula = form, data = data,
coef = b, vcov = v, df = df.residual(object), ...)
}
测试运行:
> (rg = plmrg(mod, at = list(exper = c(3,6,9))))
'emmGrid' object with variables:
exper = 3, 6, 9
residence = rural_area, north_east, nothern_central, south
health = no, yes
union = no, yes
> emmeans(rg, "residence")
NOTE: Results may be misleading due to involvement in interactions
residence emmean SE df lower.CL upper.CL
rural_area 0.313 0.0791 2677 0.1579 0.468
north_east 0.338 0.1625 2677 0.0190 0.656
nothern_central 0.243 0.1494 2677 -0.0501 0.536
south 0.281 0.1514 2677 -0.0161 0.578
Results are averaged over the levels of: exper, health, union
Confidence level used: 0.95
这些是 运行 提供或多或少相同结果的单个固定效应方法的三种不同方法(见下文)。我的主要问题是如何使用第二个模型(model_plm
)或第三个模型(model_felm
)获得预测概率或平均边际效应。我知道如何使用第一个模型 (model_lm
) 并在下面显示一个使用 ggeffects
的示例,但这仅在我有一个小样本时有效。
因为我有超过一百万个人,我的模型只能使用 model_plm
和 model_felm
。如果我使用 model_lm
,运行 需要很多时间才能处理一百万个人,因为他们在模型中受到控制。我还收到以下错误:Error: vector memory exhausted (limit reached?)
。我检查了 Whosebug 上的许多线程以解决该错误,但似乎无法解决它。
我想知道是否有解决此问题的有效方法。我的主要兴趣是提取相互作用的预测概率 residence*union
。我通常使用以下软件包之一提取预测概率或平均边际效应:ggeffects
、emmeans
或 margins
.
library(lfe)
library(plm)
library(ggeffects)
data("Males")
model_lm = lm(wage ~ exper + residence+health + residence*union +factor(nr)-1, data=Males)
model_plm = plm(wage ~ exper + residence + health + residence*union,model = "within", index=c("nr", "year"), data=Males)
model_felm = felm(wage ~ exper + residence + health + residence*union | nr, data= Males)
pred_ggeffects <- ggpredict(model_lm, c("residence","union"),
vcov.fun = "vcovCL",
vcov.type = "HC1",
vcov.args = list(cluster = Males$nr))
这个潜在的解决方案使用 biglm::biglm()
来拟合 lm 模型,然后使用 emmeans::qdrg()
并指定一个麻烦。这种方法对您的情况有帮助吗?
library(biglm)
library(emmeans)
## the biglm coefficients using factor() with all the `nr` levels has NAs.
## so restrict data to complete cases in the `biglm()` call
model_biglm <- biglm(wage ~ -1 +exper + residence+health + residence*union + factor(nr), data=Males[!is.na(Males$residence),])
summary(model_biglm)
## double check that biglm and lm give same/similar model
## summary(model_biglm)
## summary(model_lm)
summary(model_biglm)$rsq
summary(model_lm)$r.squared
identical(coef(model_biglm), coef(model_lm)) ## not identical! but plot the coefficients...
head(cbind(coef(model_biglm), coef(model_lm)))
tail(cbind(coef(model_biglm), coef(model_lm)))
plot(cbind(coef(model_biglm), coef(model_lm))); abline(0,1,col="blue")
## do a "[q]uick and [d]irty [r]eference [g]rid and follow examples
### from ?qdrg and https://cran.r-project.org/web/packages/emmeans/vignettes/FAQs.html
rg1 <- qdrg(wage ~ -1 + exper + residence+health + residence*union + factor(nr),
data = Males,
coef = coef(model_biglm),
vcov = vcov(model_biglm),
df = model_biglm$df.resid,
nuisance="nr")
## Since we already specified nuisance in qdrg() we don't in emmeans():
emmeans(rg1, c("residence","union"))
给出:
> emmeans(rg1, c("residence","union"))
residence union emmean SE df lower.CL upper.CL
rural_area no 1.72 0.1417 2677 1.44 2.00
north_east no 1.67 0.0616 2677 1.55 1.79
nothern_central no 1.53 0.0397 2677 1.45 1.61
south no 1.60 0.0386 2677 1.52 1.68
rural_area yes 1.63 0.2011 2677 1.23 2.02
north_east yes 1.72 0.0651 2677 1.60 1.85
nothern_central yes 1.67 0.0503 2677 1.57 1.77
south yes 1.68 0.0460 2677 1.59 1.77
Results are averaged over the levels of: 1 nuisance factors, health
Confidence level used: 0.95
我尝试调整 formula/datasets 让 emmeans 和 plm 发挥得更好。让我知道这里是否有东西。经过一些测试后,我意识到 biglm 答案不会为一百万人削减它。
library(emmeans)
library(plm)
data("Males")
## this runs but we need to get an equivalent result with expanded formula
## and expanded dataset
model_plm = plm(wage ~ exper + residence + health + residence*union,model = "within", index=c("nr"), data=Males)
## expanded dataset
Males2 <- data.frame(wage=Males[complete.cases(Males),"wage"],
model.matrix(wage ~ exper + residence + health + residence*union, Males),
nr=Males[complete.cases(Males),"nr"])
(fmla2 <- as.formula(paste("wage ~ ", paste(names(coef(model_plm)), collapse= "+"))))
## expanded formula
model_plm2 <- plm(fmla2,
model = "within",
index=c("nr"),
data=Males2)
(fmla2_rg <- as.formula(paste("wage ~ -1 +", paste(names(coef(model_plm)), collapse= "+"))))
plm2_rg <- qdrg(fmla2_rg,
data = Males2,
coef = coef(model_plm2),
vcov = vcov(model_plm2),
df = model_plm2$df.residual)
plm2_rg
### when all 3 residences are 0, that's `rural area`
### then just pick the rows when one of the residences are 1
emmeans(plm2_rg, c("residencenorth_east","residencenothern_central","residencesouth", "unionyes"))
在删除一些行后给出:
> ### when all 3 residences are 0, that's `rural area`
> ### then just pick the rows when one of the residences are 1
> emmeans(plm2_rg, c("residencenorth_east","residencenothern_central","residencesouth", "unionyes"))
residencenorth_east residencenothern_central residencesouth unionyes emmean SE df lower.CL upper.CL
0 0 0 0 0.3777 0.0335 2677 0.31201 0.443
1 0 0 0 0.3301 0.1636 2677 0.00929 0.651
0 1 0 0 0.1924 0.1483 2677 -0.09834 0.483
0 0 1 0 0.2596 0.1514 2677 -0.03732 0.557
0 0 0 1 0.2875 0.1473 2677 -0.00144 0.576
1 0 0 1 0.3845 0.1647 2677 0.06155 0.708
0 1 0 1 0.3326 0.1539 2677 0.03091 0.634
0 0 1 1 0.3411 0.1534 2677 0.04024 0.642
Results are averaged over the levels of: healthyes
Confidence level used: 0.95
问题似乎是,当我们将 -1
添加到公式时,会在模型矩阵中创建一个额外的列,该列未包含在回归系数中。 (这是 R 创建因子编码的方式的副产品。)
所以我可以通过添加一个战略性的零系数来解决这个问题。我们还必须以同样的方式修正协方差矩阵:
library(emmeans)
library(plm)
data("Males")
mod <- plm(wage ~ exper + residence + health + residence*union,
model = "within",
index = "nr",
data = Males)
BB <- c(coef(mod)[1], 0, coef(mod)[-1])
k <- length(BB)
VV <- matrix(0, nrow = k, ncol = k)
VV[c(1, 3:k), c(1, 3:k)] <- vcov(mod)
RG <- qdrg(~ -1 + exper + residence + health + residence*union,
data = Males, coef = BB, vcov = VV, df = df.residual(mod))
验证一切是否一致:
> names(RG@bhat)
[1] "exper" ""
[3] "residencenorth_east" "residencenothern_central"
[5] "residencesouth" "healthyes"
[7] "unionyes" "residencenorth_east:unionyes"
[9] "residencenothern_central:unionyes" "residencesouth:unionyes"
> colnames(RG@linfct)
[1] "exper" "residencerural_area"
[3] "residencenorth_east" "residencenothern_central"
[5] "residencesouth" "healthyes"
[7] "unionyes" "residencenorth_east:unionyes"
[9] "residencenothern_central:unionyes" "residencesouth:unionyes"
他们确实在排队,所以我们可以得到我们需要的结果:
(EMM <- emmeans(RG, ~ residence * union))
residence union emmean SE df lower.CL upper.CL
rural_area no 0.378 0.0335 2677 0.31201 0.443
north_east no 0.330 0.1636 2677 0.00929 0.651
nothern_central no 0.192 0.1483 2677 -0.09834 0.483
south no 0.260 0.1514 2677 -0.03732 0.557
rural_area yes 0.287 0.1473 2677 -0.00144 0.576
north_east yes 0.385 0.1647 2677 0.06155 0.708
nothern_central yes 0.333 0.1539 2677 0.03091 0.634
south yes 0.341 0.1534 2677 0.04024 0.642
Results are averaged over the levels of: health
Confidence level used: 0.95
一般来说,关键是确定添加的列出现的位置。这将是模型公式中第一个因素的第一级的位置。您可以通过查看 names(coef(mod))
和 colnames(model.matrix(formula), data = data)
来检查它,其中 formula
是删除截距的模型公式。
更新:通用函数
这是一个可用于为任何 plm
对象创建参考网格的函数。事实证明,有时这些对象 do 有一个截距(例如,随机效应模型)所以我们必须检查。对于缺少截距的模型,您真的应该仅将其用于对比。
plmrg = function(object, ...) {
form = formula(formula(object))
if (!("(Intercept)" %in% names(coef(object))))
form = update(form, ~ . - 1)
data = eval(object$call$data, environment(form))
mmat = model.matrix(form, data)
sel = which(colnames(mmat) %in% names(coef(object)))
k = ncol(mmat)
b = rep(0, k)
b[sel] = coef(object)
v = matrix(0, nrow = k, ncol = k)
v[sel, sel] = vcov(object)
emmeans::qdrg(formula = form, data = data,
coef = b, vcov = v, df = df.residual(object), ...)
}
测试运行:
> (rg = plmrg(mod, at = list(exper = c(3,6,9))))
'emmGrid' object with variables:
exper = 3, 6, 9
residence = rural_area, north_east, nothern_central, south
health = no, yes
union = no, yes
> emmeans(rg, "residence")
NOTE: Results may be misleading due to involvement in interactions
residence emmean SE df lower.CL upper.CL
rural_area 0.313 0.0791 2677 0.1579 0.468
north_east 0.338 0.1625 2677 0.0190 0.656
nothern_central 0.243 0.1494 2677 -0.0501 0.536
south 0.281 0.1514 2677 -0.0161 0.578
Results are averaged over the levels of: exper, health, union
Confidence level used: 0.95