使用包内的自定义方差函数从 gaulss-gams 进行预测时的环境问题

Question

未找到 R 包中可用的方差函数 predict 函数同时根据 gam-object 进行预测先前构建 (mgcv 1.8-31)。

R 包应（除其他事项外）从 gam 对象进行预测。全部以前构建的模型使用 gaulss 系列并有他们的自己的方差函数。一些方差函数只是线性效应一个变量，其他人使用更复杂的自定义函数。这模型和方差函数存储在 'sysdata.rda' 文件中以将它们包含在包中。该软件包记录在 devtools 和 roxygen2.

考虑以下两个 GAM 的最小示例：

library(mgcv)

set.seed(123)

var.fun <- function(x){x^2}

x <- runif(100)
y <- x + rnorm(100, 0, var.fun(x))

mod.gam.1 <- gam(formula = list(y ~ x,
                                ~ var.fun(x)),
                 family = gaulss(link = list("log", "logb")))

mod.gam.2 <- gam(formula = list(y ~ x,
                                ~ I(x^2)),
                 family = gaulss(link = list("log", "logb")))

第一个模型使用自定义方差函数。第二个模型有模型调用中硬编码的方差公式。

然后将模型和方差函数存储在 'sysdata.rda' 文件，它包含在名为 "gamvarfun" 的包中（我知道，命名事情...），如下：

save(mod.gam.1, var.fun, mod.gam.2,
     file = "~/gamvarfun/R/sysdata.rda")

现在包中添加了两个函数来从中检索预测对应机型：

pred.fun.1 <- function(x){
  predict(mod.gam.1,
          newdata = data.frame("x" = x))
}

pred.fun.2 <- function(x){
  predict(mod.gam.2,
          newdata = data.frame("x" = x))
}

为了说明问题，我上传了一个非常基本的演示包 github，所以下面的代码应该可以工作：

library(devtools)
install_github("jan-schick/gam_issue")
library(gamvarfun)

pred.fun.2(1)
# 0.3135275 0.5443409

pred.fun.1(1)
# Error in var.fun(x) : could not find function "var.fun"

# Writing var.fun to global environment
var.fun <- gamvarfun:::var.fun
pred.fun.1(1)
# 0.3135275 0.5443409

使用pred.fun.1（包含自定义方差函数）时，一个显示错误。但是，pred.fun.2（硬编码方差函数）工作得很好。不会发生此错误，当 devtools::load_all() 用于代替 'proper' 安装包裹。怀疑是环境不同导致的问题使用 predict.gam 时。我通过编写自定义来测试这个假设调用 pred.fun.1 之前对全局环境的方差函数（见上文），这是有效的。然而，这显然不是一个解决方案包。

在早期的尝试中，我试图在包内声明函数，例如通过将上面编写的代码直接放在预测中职能。安装包时也不起作用。

pred.fun.1 <- function(x){
  var.fun <- function(x){x^2}
  predict(mod.gam.1,
          newdata = data.frame("x" = x))
}

我也试过 with 和 attach 在同一个地方（在预测功能）没有任何成功。

我找到的唯一可行的解决方案是导出方差函数并将其作为包的一部分 namespace/API。这是不可行的在这种情况下的解决方案，因为它会导致许多可见的差异功能，对用户没有实际好处。

然后有一个明显的解决方法：替换方差函数通过模型调用中的原始公式，即使用 mod.gam.2 而不是 mod.gam.1。但是，这不是一个合适的解决方案要么。

请教了各种搜索引擎和同行无果。

因此，如能提供解决此问题的任何提示，我将不胜感激问题

R version 3.6.2 (2019-12-12)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.3 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1

locale:
 [1] LC_CTYPE=de_DE.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=de_DE.UTF-8        LC_COLLATE=de_DE.UTF-8
 [5] LC_MONETARY=de_DE.UTF-8    LC_MESSAGES=de_DE.UTF-8
 [7] LC_PAPER=de_DE.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] gamvarfun_0.1.0 usethis_1.5.0   devtools_2.0.2

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.2        rstudioapi_0.10   magrittr_1.5      splines_3.6.2
 [5] pkgload_1.0.2     lattice_0.20-38   R6_2.4.0          rlang_0.4.0
 [9] tools_3.6.2       grid_3.6.2        pkgbuild_1.0.3    nlme_3.1-143
[13] mgcv_1.8-31       sessioninfo_1.1.1 cli_1.1.0         withr_2.1.2
[17] remotes_2.0.4     assertthat_0.2.1  digest_0.6.20     rprojroot_1.3-2
[21] crayon_1.3.4      Matrix_1.2-18     processx_3.3.1    callr_3.2.0
[25] fs_1.3.1          ps_1.3.0          curl_4.0          testthat_2.1.1
[29] memoise_1.1.0     glue_1.3.1        compiler_3.6.2    desc_1.2.0
[33] backports_1.1.4   prettyunits_1.0.2

Answer 1

问题是公式有与之关联的环境，并且您正在保存与 globalenv() 的关联，但将函数放在其他地方。

如果您坚持使用 sysdata.rda，这并不容易修复，因为 mod.gam.1 对象将该环境复制到多个位置。打补丁 environment(mod.gam.1$formula).

还不够

我认为唯一可行的方法是将生成 mod.gam.1 对象的源包含在包源中。如果公式是在从包命名空间继承的上下文中定义的，那么将找到在那里定义的方差函数。

编辑添加：我试过这个策略，但没能成功。 mgcv 包中似乎存在错误，特别是在解释 gualss 模型使用的特殊对偶公式的代码中。我会看看是否可以找到解决方法。

第二次编辑： 运行下面的代码似乎修复了 mgcv 中存在错误的函数。它们基于 mgcv 版本 1.8-31。我不推荐运行这个脚本，除非你完全安装了那个版本。我已经向包维护者发送了一条消息；也许这些会被合并到未来的版本中。

interpret.gam0 <- function(gf,textra=NULL,extra.special=NULL)
# interprets a gam formula of the generic form:
#   y~x0+x1+x3*x4 + s(x5)+ s(x6,x7) ....
# and returns:
# 1. a model formula for the parametric part: pf (and pfok indicating whether it has terms)
# 2. a list of descriptors for the smooths: smooth.spec
# this is function does the work, and is called by in interpret.gam
# 'textra' is optional text to add to term labels
# 'extra.special' is label of extra smooth within formula.
{ p.env <- environment(gf) # environment of formula
  tf <- terms.formula(gf,specials=c("s","te","ti","t2",extra.special)) # specials attribute indicates which terms are smooth

  terms <- attr(tf,"term.labels") # labels of the model terms 
  nt <- length(terms) # how many terms?

  if (attr(tf,"response") > 0) {  # start the replacement formulae
    response <- as.character(attr(tf,"variables")[2])
  } else { 
    response <- NULL
  }
  sp <- attr(tf,"specials")$s     # array of indices of smooth terms 
  tp <- attr(tf,"specials")$te    # indices of tensor product terms
  tip <- attr(tf,"specials")$ti   # indices of tensor product pure interaction terms
  t2p <- attr(tf,"specials")$t2   # indices of type 2 tensor product terms
  zp <- if (is.null(extra.special)) NULL else attr(tf,"specials")[[extra.special]]
  off <- attr(tf,"offset") # location of offset in formula

  ## have to translate sp, tp, tip, t2p (zp) so that they relate to terms,
  ## rather than elements of the formula...
  vtab <- attr(tf,"factors") # cross tabulation of vars to terms
  if (length(sp)>0) for (i in 1:length(sp)) {
    ind <- (1:nt)[as.logical(vtab[sp[i],])]
    sp[i] <- ind # the term that smooth relates to
  }
  if (length(tp)>0) for (i in 1:length(tp)) {
    ind <- (1:nt)[as.logical(vtab[tp[i],])]
    tp[i] <- ind # the term that smooth relates to
  } 
  if (length(tip)>0) for (i in 1:length(tip)) {
    ind <- (1:nt)[as.logical(vtab[tip[i],])]
    tip[i] <- ind # the term that smooth relates to
  } 
  if (length(t2p)>0) for (i in 1:length(t2p)) {
    ind <- (1:nt)[as.logical(vtab[t2p[i],])]
    t2p[i] <- ind # the term that smooth relates to
  }
  if (length(zp)>0) for (i in 1:length(zp)) {
    ind <- (1:nt)[as.logical(vtab[zp[i],])]
    zp[i] <- ind # the term that smooth relates to
  } ## re-referencing is complete

  k <- kt <- kti <- kt2 <- ks <- kz <- kp <- 1 # counters for terms in the 2 formulae
  len.sp <- length(sp)
  len.tp <- length(tp)
  len.tip <- length(tip)
  len.t2p <- length(t2p)
  len.zp <- length(zp)
  ns <- len.sp + len.tp + len.tip + len.t2p + len.zp# number of smooths
  pav <- av <- rep("",0)
  smooth.spec <- list()
  #mgcvat <- "package:mgcv" %in% search() ## is mgcv in search path?
  mgcvns <- loadNamespace('mgcv')
  if (nt) for (i in 1:nt) { # work through all terms
    if (k <= ns&&((ks<=len.sp&&sp[ks]==i)||(kt<=len.tp&&tp[kt]==i)||(kz<=len.zp&&zp[kz]==i)||
                  (kti<=len.tip&&tip[kti]==i)||(kt2<=len.t2p&&t2p[kt2]==i))) { # it's a smooth
      ## have to evaluate in the environment of the formula or you can't find variables 
      ## supplied as smooth arguments, e.g. k <- 5;gam(y~s(x,k=k)), fails,
      ## but if you don't specify namespace of mgcv then stuff like 
      ## loadNamespace('mgcv'); k <- 10; mgcv::interpret.gam(y~s(x,k=k)) fails (can't find s)
      ## eval(parse(text=terms[i]),envir=p.env,enclos=loadNamespace('mgcv')) fails??
      ## following may supply namespace of mgcv explicitly if not on search path...
      ## If 's' etc are masked then we can fail even if mgcv on search path, hence paste
      ## of explicit mgcv reference into first attempt...

      st <- try(eval(parse(text=paste("mgcv::",terms[i],sep="")),envir=p.env),silent=TRUE)
      if (inherits(st,"try-error")) st <- 
            eval(parse(text=terms[i]),enclos=p.env,envir=mgcvns)

      if (!is.null(textra)) { ## modify the labels on smooths with textra
        pos <- regexpr("(",st$lab,fixed=TRUE)[1]
        st$label <- paste(substr(st$label,start=1,stop=pos-1),textra,
                    substr(st$label,start=pos,stop=nchar(st$label)),sep="")
      }
      smooth.spec[[k]] <- st
      if (ks<=len.sp&&sp[ks]==i) ks <- ks + 1 else # counts s() terms
      if (kt<=len.tp&&tp[kt]==i) kt <- kt + 1 else # counts te() terms
      if (kti<=len.tip&&tip[kti]==i) kti <- kti + 1 else # counts ti() terms
      if (kt2<=len.t2p&&t2p[kt2]==i) kt2 <- kt2 + 1 # counts t2() terms
      else kz <- kz + 1
      k <- k + 1      # counts smooth terms 
    } else {          # parametric
      av[kp] <- terms[i] ## element kp on rhs of parametric
      kp <- kp+1    # counts parametric terms
    }
  }    
  if (!is.null(off)) { ## deal with offset 
    av[kp] <- as.character(attr(tf,"variables")[1+off])
    kp <- kp+1          
  }

  pf <- paste(response,"~",paste(av,collapse=" + "))
  if (attr(tf,"intercept")==0) {
    pf <- paste(pf,"-1",sep="")
    if (kp>1) pfok <- 1 else pfok <- 0
  } else { 
    pfok <- 1;if (kp==1) { 
      pf <- paste(pf,"1"); 
    }
  }

  fake.formula <- pf

  if (length(smooth.spec)>0) 
  for (i in 1:length(smooth.spec)) {
    nt <- length(smooth.spec[[i]]$term)
    ff1 <- paste(smooth.spec[[i]]$term[1:nt],collapse="+")
    fake.formula <- paste(fake.formula,"+",ff1)
    if (smooth.spec[[i]]$by!="NA") {
      fake.formula <- paste(fake.formula,"+",smooth.spec[[i]]$by)
      av <- c(av,smooth.spec[[i]]$term,smooth.spec[[i]]$by)
    } else av <- c(av,smooth.spec[[i]]$term)
  }
  fake.formula <- as.formula(fake.formula,p.env)
  if (length(av)) {
    pred.formula <- as.formula(paste("~",paste(av,collapse="+")))
    pav <- all.vars(pred.formula) ## trick to strip out 'offset(x)' etc...
    pred.formula <- reformulate(pav) 
  environment(pred.formula) <- environment(gf)
  } else  pred.formula <- ~1
  ret <- list(pf=as.formula(pf,p.env),pfok=pfok,smooth.spec=smooth.spec,
            fake.formula=fake.formula,response=response,fake.names=av,
            pred.names=pav,pred.formula=pred.formula)
  class(ret) <- "split.gam.formula"
  ret
} ## interpret.gam0

interpret.gam <- function(gf,extra.special=NULL) {
## wrapper to allow gf to be a list of formulae or 
## a single formula. This facilitates general penalized 
## likelihood models in which several linear predictors 
## may be involved...
##
## The list syntax is as follows. The first formula must have a response on
## the lhs, rather than labels. For m linear predictors, there 
## must be m 'base formulae' in linear predictor order. lhs labels will 
## be ignored in a base formula. Empty base formulae have '-1' on rhs.
## Further formulae have labels up to m labels 1,...,m on the lhs, in a 
## syntax like this: 3 + 5 ~ s(x), which indicates that the same s(x) 
## should be added to both linear predictors 3 and 5. 
## e.g. A bivariate normal model with common expected values might be
## list(y1~-1,y2~-1,1+2~s(x)), whereas if the second component was contaminated 
## by something else we might have list(y1~-1,y2~s(v)-1,1+2~s(x)) 
## 
## For a list argument, this routine returns a list of split.formula objects 
## with an extra field "lpi" indicating the linear predictors to which each 
## contributes...
  if (is.list(gf)) {
    d <- length(gf)

    ## make sure all formulae have a response, to avoid
    ## problems with parametric sub formulae of the form ~1
    #if (length(gf[[1]])<3) stop("first formula must specify a response")
    resp <- gf[[1]][2]

    ret <- list()
    pav <- av <- rep("",0)
    nlp <- 0 ## count number of linear predictors (may be different from number of formulae)
    for (i in 1:d) {
      textra <- if (i==1) NULL else paste(".",i-1,sep="") ## modify smooth labels to identify to predictor  

      lpi <- getNumericResponse(gf[[i]]) ## get linear predictors to which this applies, if explicit
      if (length(lpi)==1) warning("single linear predictor indices are ignored")
      if (length(lpi)>0) gf[[i]][[2]] <- NULL else { ## delete l.p. labels from formula response 
        nlp <- nlp + 1;lpi <- nlp ## this is base formula for l.p. number nlp       
      }
      ret[[i]] <- interpret.gam0(gf[[i]],textra,extra.special=extra.special)
      ret[[i]]$lpi <- lpi ## record of the linear predictors to which this applies

      ## make sure all parametric formulae have a response, to avoid
      ## problems with parametric sub formulae of the form ~1
      respi <- rep("",0) ## no extra response terms
      if (length(ret[[i]]$pf)==2) { 
        ret[[i]]$pf[3] <- ret[[i]]$pf[2];ret[[i]]$pf[2] <- resp
        respi <- rep("",0)
      } else if (i>1) respi <- ret[[i]]$response ## extra response terms
      av <- c(av,ret[[i]]$fake.names,respi) ## accumulate all required variable names 
      pav <- c(pav,ret[[i]]$pred.names) ## predictors only 
    } 
    av <- unique(av) ## strip out duplicate variable names
    pav <- unique(pav)
    if (length(av)>0) {
      ## work around - reformulate with response = "log(x)" will treat log(x) as a name,
      ## not the call it should be... 
      fff <- formula(paste(ret[[1]]$response,"~ ."))
      ret$fake.formula <- reformulate(av,response=ret[[1]]$response) 
      environment(ret$fake.formula) <- environment(gf[[1]]) 
      ret$fake.formula[[2]] <- fff[[2]] ## fix messed up response
    } else ret$fake.formula <- ret[[1]]$fake.formula ## create fake formula containing all variables
    ret$pred.formula <- if (length(pav)>0) reformulate(pav) else ~1 ## predictor only formula
    environment(ret$pred.formula) <- environment(gf[[1]])
    ret$response <- ret[[1]]$response 
    ret$nlp <- nlp ## number of linear predictors
    for (i in 1:d) if (max(ret[[i]]$lpi)>nlp||min(ret[[i]]$lpi)<1) stop("linear predictor labels out of range")
    class(ret) <- "split.gam.formula"
    return(ret)
  } else interpret.gam0(gf,extra.special=extra.special)  
} ## interpret.gam


## Now some test code.

environment(interpret.gam) <- environment(mgcv::interpret.gam)
environment(interpret.gam0) <- environment(mgcv:::interpret.gam0)
assignInNamespace("interpret.gam", interpret.gam, "mgcv")
assignInNamespace("interpret.gam0", interpret.gam0, "mgcv")

set.seed(123)

mod.gam.1 <- local({
    var.fun <- function(x){x^2}

    x <- runif(100)
    y <- x + rnorm(100, 0, var.fun(x))

    gam(formula = list(y ~ x,
                                         ~ var.fun(x)),
            family = gaulss(link = list("log", "logb")))
})

pred.fun.1 <- function(x){
    predict(mod.gam.1,
                  newdata = data.frame("x" = x))
}

pred.fun.1(1)

使用包内的自定义方差函数从 gaulss-gams 进行预测时的环境问题

Environmental problems while predicting from gaulss-gams with a custom variance function inside a package

r

package

prediction

gam

mgcv