如何使用公式在 R 中可靠地使用矩阵（多变量）响应？

Question

我正在尝试在 R 中使用多变量响应并遇到该死的公式，并且有各种意外行为，主要是在函数和包中使用它们时。这个问题是双重的

I can input a multivariable response and be able to use the model to predict afterwards?

此MVE，使用rpart包，作为示例。这里 y 是一个双列矩阵（响应），我想使用 x 预测 y，即 x 中的每一列（此 MVE 中的两列）。请注意，method 本身与 y 中每一列的含义无关，这只是重现问题的 MVE：

library(rpart)

set.seed(1)
y <- rpois(10, lambda = 1.25)
y <- cbind(c(1,4,10,11,12, 14,16,17,20, 21), y)
print(y)
x <- matrix(1:20, ncol = 2) # just two dummy predictors
print(x)

mymodel <- rpart(y ~ x, method = "poisson", minbucket = 1)

newx <- matrix(11:20, ncol = 2) # just some dummy test predictors, note that we have less rows
predict(mymodel, newdata = data.frame(newx))
# output:
        1          2          3          4          5          6          7          8          
 9         10 
 0.12500000 0.12500000 0.12500000 0.20000000 0.04761905 0.17948718 0.17948718 0.11538462 
 0.04000000 0.04000000 
 Warning message:
 'newdata' had 5 rows but variables found have 10 rows

如您所见，我无法预测新的数据集。我一直在搞乱列名和行名，但一直无法正常工作。

此外，

How can I make a wrapper that is "safe"?

例如，在这个 MVE 中：

mywrapper <- function(y, x){
  mymodel <- rpart(y ~ x, method = "poisson", minbucket = 1)
  
  return(mymodel)
}

并提供了 R 文档中提供的帮助：

A formula object has an associated environment, and this environment (rather than the parent environment) is used by model.frame to evaluate variables that are not found in the supplied data argument. Formulas created with the ~ operator use the environment in which they were created. Formulas created with as.formula will use the env argument for their environment.

我不太明白这是什么意思。据我了解，不输入 y 或 x 到 mywrapper() 将导致错误（这是预期的行为）。我问是因为我正在处理包内的 r 包和函数，我想确保公式没有意外行为。

Answer 1

我没有使用过 rpart:predict，但根据文档，对于此功能，您需要一个新数据集，该数据集与原始数据集具有相同的变量。

因此，您应该使用正确的列名启动 newx：

newx = matrix(11:20, ncol = 2,dimnames=list(NULL,c("x1","x2")))

现在，这些列被标记为 x1 和 x2，就像您的模型和预测中的变量知道如何处理这些列一样。

Answer 2

一般来说，R 中的公式适用于数据框。 rpart 适用于矩阵，虽然数据帧可以包含矩阵，但它们往往会转换为单独的列。为避免这种情况，请将矩阵包装在 I():

中

# Same as your code to start...then this:

predict(mymodel, newdata = data.frame(x = I(newx)))
#>    1    2    3    4    5 
#> 0.04 0.04 0.04 0.04 0.04

在问题的第二部分，您在 mywrapper 函数中创建了一个公式，因此如果变量不包含在 newdata 数据框中，它将在此处查找变量。 R 中的“环境”类似于其他语言中的“堆栈框架”；主要区别在于环境只有一个父级，如果在原始环境中找不到该对象，则搜索会在那里进行。

一般来说父级不是调用者的框架，它是创建环境的框架，或者特别列为父级的东西。

那么，如果您对 mywrapper 的返回值运行 predict 会发生什么，它会查看公式以找到它需要的变量。预测只需要右侧的变量，所以只有 x。如果您在 predict 的 newdata 参数中提供 x，一切都会正常并像以前一样进行，但如果不这样做，情况就不同了。

由于在 newdata 中未找到 x，因此转到公式的环境。那是 mywrapper 的评估框架，它将在那里看到 x，因为它是该函数的参数。

如果它正在寻找 z，它不会在那里找到它。下一个要看的地方是父环境，它是创建 mywrapper 时有效的环境，即全局环境。如果那里没有 z，它将搜索 search() 列出的环境链，这些环境通常是包导出。 search() 列表链接在一起，因此每个条目都是前一个条目的父项。

我希望这不是太多信息....

如何使用公式在 R 中可靠地使用矩阵（多变量）响应？

How to reliably use a matrix (multivariable) response in R using a formula?

r

machine-learning

predict

multivalue