减少 glmer 模型大小
Reduce glmer model size
我是 R 的新手,我正在使用 glmer 来拟合多个二项式模型,我只需要它们来调用 predict
以使用结果概率。但是,我有一个非常大的数据集,即使只有一个模型的大小也会变得非常大:
> library(pryr)
> object_size(mod)
701 MB
相比之下,模型系数的大小相形见绌:
> object_size(coef(mod))
1.16 MB
拟合值的大小也是如此:
> object_size(fitted(mod))
25.6 MB
首先,我不明白为什么模型的对象尺寸这么大。它似乎包含用于拟合模型的原始数据框,但即使这样也没有考虑大小。为什么这么大?
其次,是否可以将模型剥离为仅调用预测所需的部分?如果是这样,我将如何去做呢?我在 http://blog.yhathq.com/posts/reducing-your-r-memory-footprint-by-7000x.html 找到了一个 post,这是为 glm
完成的,但似乎 glmer 模型的访问方式不同并且具有不同的组件。
如有任何帮助,我们将不胜感激。
编辑:
深入模型的内部构造:
> object_size(getME(mod, "X"))
205 MB
> object_size(getME(mod, "Z"))
36.9 MB
> object_size(getME(mod, "Zt"))
38.4 MB
> object_size(getME(mod, "Ztlist"))
41.6 MB
> object_size(getME(mod, "mmList"))
38.4 MB
> object_size(getME(mod, "y"))
3.2 MB
> object_size(getME(mod, "mu"))
3.2 MB
> object_size(getME(mod, "u"))
18.4 kB
> object_size(getME(mod, "b"))
19.5 kB
> object_size(getME(mod, "Gp"))
56 B
> object_size(getME(mod, "Tp"))
472 B
> object_size(getME(mod, "L"))
15.5 MB
> object_size(getME(mod, "Lambda"))
38.1 kB
> object_size(getME(mod, "Lambdat"))
38.1 kB
> object_size(getME(mod, "Lind"))
9.22 kB
> object_size(getME(mod, "Tlist"))
936 B
> object_size(getME(mod, "A"))
38.4 MB
> object_size(getME(mod, "RX"))
30.3 kB
> object_size(getME(mod, "RZX"))
1.05 MB
> object_size(getME(mod, "sigma"))
48 B
> object_size(getME(mod, "flist"))
4.89 MB
> object_size(getME(mod, "fixef"))
4.5 kB
> object_size(getME(mod, "beta"))
496 B
> object_size(getME(mod, "theta"))
472 B
> object_size(getME(mod, "ST"))
936 B
> object_size(getME(mod, "REML"))
48 B
> object_size(getME(mod, "is_REML"))
48 B
> object_size(getME(mod, "n_rtrms"))
48 B
> object_size(getME(mod, "n_rfacs"))
48 B
> object_size(getME(mod, "N"))
256 B
> object_size(getME(mod, "n"))
256 B
> object_size(getME(mod, "p"))
256 B
> object_size(getME(mod, "q"))
256 B
> object_size(getME(mod, "p_i"))
408 B
> object_size(getME(mod, "l_i"))
408 B
> object_size(getME(mod, "q_i"))
408 B
> object_size(getME(mod, "mod"))
48 B
> object_size(getME(mod, "m_i"))
424 B
> object_size(getME(mod, "m"))
48 B
> object_size(getME(mod, "cnms"))
624 B
> object_size(getME(mod, "devcomp"))
2.21 kB
> object_size(getME(mod, "offset"))
3.2 MB
> get_obj_size(mod@resp, "RC")
[,1]
family 673355488
initialize 673355488
initialize#lmResp 673355488
ptr 673355488
resDev 673355488
updateMu 673355488
updateWts 673355488
wrss 673355488
eta 3196024
mu 3196024
n 3196024
offset 3196024
sqrtrwt 3196024
sqrtXwt 3196024
weights 3196024
wtres 3196024
y 3196024
Ptr 40
> get_obj_size(mod@pp, "RC")
[,1]
beta 449419408
initialize 449419408
initializePtr 449419408
ldL2 449419408
ldRX2 449419408
linPred 449419408
ptr 449419408
setTheta 449419408
sqrL 449419408
u 449419408
X 204549128
V 182171288
Ut 38448168
Zt 38448168
LamtUt 38353248
Xwts 3196024
RZX 1047176
Lambdat 38136
VtV 26192
delu 18408
u0 18408
Utr 18408
Lind 9224
beta0 496
delb 496
Vtr 496
theta 72
Ptr 40
您担心存储 space 还是 RAM?如果它是关于存储的,一种选择是将调用嵌入到生成预测的代码中来估计模型,这样你就永远不会真正存储模型对象。类似于:
predictions <- predict(glmer(y ~ x, family = binomial), type = "response")
暂时作为不完整的答案发布:
library("lme4")
gm1 <- glmer(cbind(incidence, size - incidence) ~ period + (1 | herd),
data = cbpp, family = binomial)
library("pryr")
object_size(gm1) ## 505 kB
按照 Steve Walker 的 S3/S4/Reference class 字典列出和提取字段:
get_obj_size <- function(obj,type="S4") {
fields <- switch(type,
S4=slotNames(obj),
RC=ls(obj))
get_field <- switch(type,
S4=function(x) slot(obj,x),
RC=function(x) obj[[x]])
field_list <- setNames(lapply(fields,get_field),fields)
cbind(sort(sapply(field_list,object_size),decreasing=TRUE))
}
get_obj_size(gm1)
## [,1]
## resp 356620 ## 'response module'
## pp 355420 ## 'predictor module'
## frame 6640
## optinfo 1748
## devcomp 1424
## call 1244
## flist 1232
## cnms 224
## u 152
## beta 56
## Gp 32
## lower 32
## theta 32
值得进一步深入研究响应和预测模块,看看 there/what 有什么大不了的,caveats/complication 一些信息将存储在 环境中 个组件
例如,我认为下面所有名义上大小相同的组件实际上并不是独立的,而是具有相同的环境...
get_obj_size(gm1@resp,"RC")
## [,1]
## initialize 356620
## initialize#lmResp 356620
## ptr 356620
## resDev 356620
## setOffset 356620
## updateMu 356620
## updateWts 356620
## wrss 356620
## family 26016
## eta 472
## mu 472
## n 472
## offset 472
## sqrtrwt 472
## sqrtXwt 472
## weights 472
## wtres 472
## y 472
## Ptr 20
另一种查看存储了哪些组件的方法是使用 object_size(getME(model,component))
并遍历通过 eval(formals(getME)$name)
列出的组件;这不太准确地对应于信息在内部存储的方式,但会让您了解需要多少 space 来保存(例如)固定效应或随机效应模型矩阵。
我在这方面做了更多工作并得到了部分解决方案,但仍有很多存储空间我似乎无法 find/trim 正确删除(注意这需要 Github 上的最新版本 lme4
:我不得不稍微修改 predict
函数以削弱对内部结构的依赖)。
glmer_chop <- function(object) {
newobj <- object
newobj@frame <- model.frame(object)[0,]
newobj@pp <- with(object@pp,
new("merPredD",
Lambdat=Lambdat,
Lind=Lind,
theta=theta,
u=u,u0=u0,
n=nrow(X),
X=matrix(1,nrow=nrow(X)),
Zt=Zt)) ## .sparseDiagonal(n,shape="g")))
newobj@resp <- new("glmResp",family=binomial(),y=numeric(0))
return(newobj)
}
get_obj_size(environment(fm2@pp$initialize),"RC")
fm1 <- glmer(use ~ urban+age+livch+(1|district), Contraception, binomial)
object_size(Contraception) ## 133 kB
object_size(fm1) ## 1.05 MB
object_size(fm2 <- glmer_chop(fm1)) ## 699 kB
get_obj_size(fm2) ## 'pp' is 547200 bytes
get_obj_size(fm2@pp,"RC") ## 'initialize' object is 547200
saveRDS(fm2,file="tmp.rds")
fm2 <- readRDS("tmp.rds")
object_size(fm2) ## 796 kB
rm(fm1)
pp <- predict(fm2,newdata=Contraception)
object_size(fm2) ## still 796K; no sharing
最后请注意,compare_size(fm2)
确认这里的大部分信息都存储在环境中,而不是对象本身(但我不知道 compare_size
/object.size
如何处理参考 classes ...)
我是 R 的新手,我正在使用 glmer 来拟合多个二项式模型,我只需要它们来调用 predict
以使用结果概率。但是,我有一个非常大的数据集,即使只有一个模型的大小也会变得非常大:
> library(pryr)
> object_size(mod)
701 MB
相比之下,模型系数的大小相形见绌:
> object_size(coef(mod))
1.16 MB
拟合值的大小也是如此:
> object_size(fitted(mod))
25.6 MB
首先,我不明白为什么模型的对象尺寸这么大。它似乎包含用于拟合模型的原始数据框,但即使这样也没有考虑大小。为什么这么大?
其次,是否可以将模型剥离为仅调用预测所需的部分?如果是这样,我将如何去做呢?我在 http://blog.yhathq.com/posts/reducing-your-r-memory-footprint-by-7000x.html 找到了一个 post,这是为 glm
完成的,但似乎 glmer 模型的访问方式不同并且具有不同的组件。
如有任何帮助,我们将不胜感激。
编辑:
深入模型的内部构造:
> object_size(getME(mod, "X"))
205 MB
> object_size(getME(mod, "Z"))
36.9 MB
> object_size(getME(mod, "Zt"))
38.4 MB
> object_size(getME(mod, "Ztlist"))
41.6 MB
> object_size(getME(mod, "mmList"))
38.4 MB
> object_size(getME(mod, "y"))
3.2 MB
> object_size(getME(mod, "mu"))
3.2 MB
> object_size(getME(mod, "u"))
18.4 kB
> object_size(getME(mod, "b"))
19.5 kB
> object_size(getME(mod, "Gp"))
56 B
> object_size(getME(mod, "Tp"))
472 B
> object_size(getME(mod, "L"))
15.5 MB
> object_size(getME(mod, "Lambda"))
38.1 kB
> object_size(getME(mod, "Lambdat"))
38.1 kB
> object_size(getME(mod, "Lind"))
9.22 kB
> object_size(getME(mod, "Tlist"))
936 B
> object_size(getME(mod, "A"))
38.4 MB
> object_size(getME(mod, "RX"))
30.3 kB
> object_size(getME(mod, "RZX"))
1.05 MB
> object_size(getME(mod, "sigma"))
48 B
> object_size(getME(mod, "flist"))
4.89 MB
> object_size(getME(mod, "fixef"))
4.5 kB
> object_size(getME(mod, "beta"))
496 B
> object_size(getME(mod, "theta"))
472 B
> object_size(getME(mod, "ST"))
936 B
> object_size(getME(mod, "REML"))
48 B
> object_size(getME(mod, "is_REML"))
48 B
> object_size(getME(mod, "n_rtrms"))
48 B
> object_size(getME(mod, "n_rfacs"))
48 B
> object_size(getME(mod, "N"))
256 B
> object_size(getME(mod, "n"))
256 B
> object_size(getME(mod, "p"))
256 B
> object_size(getME(mod, "q"))
256 B
> object_size(getME(mod, "p_i"))
408 B
> object_size(getME(mod, "l_i"))
408 B
> object_size(getME(mod, "q_i"))
408 B
> object_size(getME(mod, "mod"))
48 B
> object_size(getME(mod, "m_i"))
424 B
> object_size(getME(mod, "m"))
48 B
> object_size(getME(mod, "cnms"))
624 B
> object_size(getME(mod, "devcomp"))
2.21 kB
> object_size(getME(mod, "offset"))
3.2 MB
> get_obj_size(mod@resp, "RC")
[,1]
family 673355488
initialize 673355488
initialize#lmResp 673355488
ptr 673355488
resDev 673355488
updateMu 673355488
updateWts 673355488
wrss 673355488
eta 3196024
mu 3196024
n 3196024
offset 3196024
sqrtrwt 3196024
sqrtXwt 3196024
weights 3196024
wtres 3196024
y 3196024
Ptr 40
> get_obj_size(mod@pp, "RC")
[,1]
beta 449419408
initialize 449419408
initializePtr 449419408
ldL2 449419408
ldRX2 449419408
linPred 449419408
ptr 449419408
setTheta 449419408
sqrL 449419408
u 449419408
X 204549128
V 182171288
Ut 38448168
Zt 38448168
LamtUt 38353248
Xwts 3196024
RZX 1047176
Lambdat 38136
VtV 26192
delu 18408
u0 18408
Utr 18408
Lind 9224
beta0 496
delb 496
Vtr 496
theta 72
Ptr 40
您担心存储 space 还是 RAM?如果它是关于存储的,一种选择是将调用嵌入到生成预测的代码中来估计模型,这样你就永远不会真正存储模型对象。类似于:
predictions <- predict(glmer(y ~ x, family = binomial), type = "response")
暂时作为不完整的答案发布:
library("lme4")
gm1 <- glmer(cbind(incidence, size - incidence) ~ period + (1 | herd),
data = cbpp, family = binomial)
library("pryr")
object_size(gm1) ## 505 kB
按照 Steve Walker 的 S3/S4/Reference class 字典列出和提取字段:
get_obj_size <- function(obj,type="S4") {
fields <- switch(type,
S4=slotNames(obj),
RC=ls(obj))
get_field <- switch(type,
S4=function(x) slot(obj,x),
RC=function(x) obj[[x]])
field_list <- setNames(lapply(fields,get_field),fields)
cbind(sort(sapply(field_list,object_size),decreasing=TRUE))
}
get_obj_size(gm1)
## [,1]
## resp 356620 ## 'response module'
## pp 355420 ## 'predictor module'
## frame 6640
## optinfo 1748
## devcomp 1424
## call 1244
## flist 1232
## cnms 224
## u 152
## beta 56
## Gp 32
## lower 32
## theta 32
值得进一步深入研究响应和预测模块,看看 there/what 有什么大不了的,caveats/complication 一些信息将存储在 环境中 个组件
例如,我认为下面所有名义上大小相同的组件实际上并不是独立的,而是具有相同的环境...
get_obj_size(gm1@resp,"RC")
## [,1]
## initialize 356620
## initialize#lmResp 356620
## ptr 356620
## resDev 356620
## setOffset 356620
## updateMu 356620
## updateWts 356620
## wrss 356620
## family 26016
## eta 472
## mu 472
## n 472
## offset 472
## sqrtrwt 472
## sqrtXwt 472
## weights 472
## wtres 472
## y 472
## Ptr 20
另一种查看存储了哪些组件的方法是使用 object_size(getME(model,component))
并遍历通过 eval(formals(getME)$name)
列出的组件;这不太准确地对应于信息在内部存储的方式,但会让您了解需要多少 space 来保存(例如)固定效应或随机效应模型矩阵。
我在这方面做了更多工作并得到了部分解决方案,但仍有很多存储空间我似乎无法 find/trim 正确删除(注意这需要 Github 上的最新版本 lme4
:我不得不稍微修改 predict
函数以削弱对内部结构的依赖)。
glmer_chop <- function(object) {
newobj <- object
newobj@frame <- model.frame(object)[0,]
newobj@pp <- with(object@pp,
new("merPredD",
Lambdat=Lambdat,
Lind=Lind,
theta=theta,
u=u,u0=u0,
n=nrow(X),
X=matrix(1,nrow=nrow(X)),
Zt=Zt)) ## .sparseDiagonal(n,shape="g")))
newobj@resp <- new("glmResp",family=binomial(),y=numeric(0))
return(newobj)
}
get_obj_size(environment(fm2@pp$initialize),"RC")
fm1 <- glmer(use ~ urban+age+livch+(1|district), Contraception, binomial)
object_size(Contraception) ## 133 kB
object_size(fm1) ## 1.05 MB
object_size(fm2 <- glmer_chop(fm1)) ## 699 kB
get_obj_size(fm2) ## 'pp' is 547200 bytes
get_obj_size(fm2@pp,"RC") ## 'initialize' object is 547200
saveRDS(fm2,file="tmp.rds")
fm2 <- readRDS("tmp.rds")
object_size(fm2) ## 796 kB
rm(fm1)
pp <- predict(fm2,newdata=Contraception)
object_size(fm2) ## still 796K; no sharing
最后请注意,compare_size(fm2)
确认这里的大部分信息都存储在环境中,而不是对象本身(但我不知道 compare_size
/object.size
如何处理参考 classes ...)