随机拦截 GLM

Question

我想在 R 中拟合随机截距互补对数对数回归，以检查未观察到的用户异质性。我通过互联网和书籍进行了搜索，但在 Stata 中只找到了一个解决方案，也许有人可以将其改编为 R。在 Stata 中有 2 个命令可用：

xtcloglog 用于二级随机拦截
gllamm 对于随机系数和更高级别的模型

我的数据与人们的活动是否完成以及受阳光影响有关 - completion 是结果变量，sunshine 和下面提到的其他变量是解释变量；这是一个简化版本。

    581755 obs. of 68 variables:
     $ activity          : int  37033 37033 37033 37033 37033 37033 37033 37033 37033 37033 ...
     $ people         : int  5272 5272 5272 5272 5272 5272 5272 5272 5272 5272 ...
     $ completion: num 0 0 0 0 0 0 0 0 0 0 ...
     $ active            : int  0 2 2 2 2 2 2 2 2 2 ...
     $ overdue           : int  0 0 0 0 0 0 0 0 0 0 ...
     $ wdsp              : num  5.7 5.7 7.7 6.4 3.9 5.8 3.5 6.3 4.8 9.4 ...
     $ rain              : num  0 0 0 0 0 0 0 0 0 0 ...
     $ UserCompletionRate: num [1:581755, 1] NaN -1.55 -1.55 -1.55 -1.55 ...
      ..- attr(*, "scaled:center")= num 0.462
      ..- attr(*, "scaled:scale")= num 0.298
     $ DayofWeekSu       : num  0 0 0 0 0 1 0 0 0 0 ...
     $ DayofWeekMo       : num  0 0 0 0 0 0 1 0 0 0 ...
     $ DayofWeekTu       : num  1 0 0 0 0 0 0 1 0 0 ...
     $ DayofWeekWe       : num  0 1 0 0 0 0 0 0 1 0 ...
     $ DayofWeekTh       : num  0 0 1 0 0 0 0 0 0 1 ...
     $ DayofWeekFr       : num  0 0 0 1 0 0 0 0 0 0 ...

     $ MonthofYearJan    : num  0 0 0 0 0 0 0 0 0 0 ...
     $ MonthofYearFeb    : num  0 0 0 0 0 0 0 0 0 0 ...
     $ MonthofYearMar    : num  0 0 0 0 0 0 0 0 0 0 ...
     $ MonthofYearApr    : num  0 0 0 0 0 0 0 0 0 0 ...
     $ MonthofYearMay    : num  0 0 0 0 0 0 0 0 0 0 ...
     $ MonthofYearJun    : num  1 1 1 1 1 1 1 1 1 1 ...
     $ MonthofYearJul    : num  0 0 0 0 0 0 0 0 0 0 ...
     $ MonthofYearAug    : num  0 0 0 0 0 0 0 0 0 0 ...
     $ MonthofYearSep    : num  0 0 0 0 0 0 0 0 0 0 ...
     $ MonthofYearOct    : num  0 0 0 0 0 0 0 0 0 0 ...
     $ MonthofYearNov    : num  0 0 0 0 0 0 0 0 0 0 ...
     $ cold              : num  0 0 0 0 0 0 0 0 0 0 ...
     $ hot               : num  0 0 0 0 0 0 0 0 0 0 ...
     $ overduetask       : num  0 0 0 0 0 0 0 0 0 0 ...

原始（简化）数据：

 df <-  people = c(1,1,1,2,2,3,3,4,4,5,5),
        activity = c(1,1,1,2,2,3,4,5,5,6,6),
        completion = c(0,0,1,0,1,1,1,0,1,0,1),
        sunshine = c(1,2,3,4,5,4,6,2,4,8,4)

到目前为止，我已将此代码用于 cloglog：

model <- as.formula("completion ~  sunshine")
clog_full = glm(model,data=df,family = binomial(link = cloglog))
summary(clog_full)

使用包 glmmML：

model_re <- as.formula("completion ~  sunshine")
clog_re = glmmML(model_re,cluster = people, data= df,
    family = binomial(link = cloglog)) 
summary(clog_re)

使用包 lme4:

model_re1<- as.formula("completion ~  (1|people) + sunshine") 
clog_re1 <- glmer(model_re1, data=df,
   family = binomial(link = cloglog)) 
summary(clog_re1)

然而，R 并没有从中得到任何结果，只是运行s 它们但从未得到结果。我是否必须将人员或活动用作集群？

如果有人也对如何运行这个模型有固定截距有想法，我很高兴知道。

Answer 1

这对我来说很好（或多或少：见下面的注释）

## added data.frame()
df <-  data.frame(people = c(1,1,1,2,2,3,3,4,4,5,5),
        activity = c(1,1,1,2,2,3,4,5,5,6,6),
        completion = c(0,0,1,0,1,1,1,0,1,0,1),
        sunshine = c(1,2,3,4,5,4,6,2,4,8,4)
        )

model_re1 <- completion ~  (1|people) + sunshine
clog_re1 <- glmer(model_re1, data=df,
                  family = binomial(link = cloglog))

这很快就完成了（不到一秒）：也许您忘记了关闭引号或括号之类的...？
但是，它确实会产生一条消息 "boundary (singular) fit: see ?isSingular"，这是因为您的数据集是如此 small/noisy，以至于人与人之间的差异的最佳估计值为零（因为它不能为负).

更新：很遗憾地告诉您，混合模型 (GLMM) 的计算强度明显高于标准 GLM：具有 68 个预测变量的 500K 个观测值绝对是一个大问题，你应该预计适合需要几个小时。我有几点建议：

您绝对应该尝试数据的子集（观测值和预测变量）以了解计算时间将如何扩展
据我所知，glmmTMB 包是 R 中解决此问题最快的选项（lme4 会严重扩展大量预测变量），但可能运行进入内存限制。 MixedModels.jl Julia 包可能更快。
您通常可以打开一个 "verbose" 或 "tracing" 选项，让您知道该模型至少在解决问题，而不是完全卡住（知道如何做并不是真的可行它需要很长时间才能完成，但至少你知道发生了什么......）
如果 Stata 快得多（我怀疑，但有可能）你可以使用它

这是一个包含 10,000 个观测值和单个预测变量的示例。

n <- 10000
set.seed(101)
dd <- data.frame(people=factor(rep(1:10000,each=3)),
                 sunshine=sample(1:8,size=n, replace=TRUE))
dd$completion <- simulate(~(1|people)+sunshine,
                          newdata=dd,
                          family=binomial(link="cloglog"),
                          newparams=list(beta=c(0,1),
                                         theta=1))[[1]]

glmer 运行s 80 秒然后失败：

system.time(update(clog_re1, data=dd, verbose=100))

另一方面，glmmTMB在大约 20 秒内解决了这个问题（我的计算机上有 8 个内核，glmmTMB 使用了所有内核，所以 CPU 分配这项工作的比例上升到 750%；如果您的内核较少，经过的计算时间将相应增加）。

library(glmmTMB)
system.time(glmmTMB(model_re1,data=dd,family = binomial(link = cloglog),
                    verbose=TRUE))

如果我再添加四个预测变量（总共 5 个），计算时间将增加到 46 秒（再次使用 8 个核心：所有核心的总计算时间为 320 秒）。预测变量的数量是原来的 13 倍，观测值的数量是原来的 50 倍，您绝对应该预料到这是一项具有挑战性的计算。

非常异质性的粗略评估是拟合同质模型并将残差偏差（或 Pearson 残差平方和）与模型的残差自由度进行比较;如果前者大得多，那就是某种形式的不匹配（异质性或其他原因）的证据。

随机拦截 GLM

Random Intercept GLM

random

r

fixed

intercept

lme4