R使用列索引号预测glm适合数据框中的每一列

Question

尝试将 BLR 模型拟合到数据框中的每一列，然后对新数据点进行预测。有很多列，因此无法按名称识别列，只能按列号识别。在查看了本网站上几个类似性质的示例后，无法弄清楚为什么这不起作用。

df <- data.frame(x1 = runif(1000, -10, 10),
                 x2 = runif(1000, -2, 2),
                 x3 = runif(1000, -5, 5),
                 y = rbinom(1000, size = 1, prob = 0.40))

for (i in 1:length(df)-1)
{
        fit <- glm (y ~ df[,i], data = df, family = binomial, na.action = na.exclude)

        new_pts <- data.frame(seq(min(df[,i], na.rm = TRUE), max(df[,i], na.rm = TRUE), len = 200))
        names(new_pts) <- names(df[, i])

        new_pred <- predict(fit, newdata = new_pts, type = "response")

}

predict() 函数引发警告消息和 returns 数组 1000 个元素长，而测试数据只有 200 个元素。

警告信息：警告信息： 'newdata' 有 200 行，但找到的变量有 1000 行

Answer 1

对于重复建模，我使用了如下所示的类似方法。我已经用 data.table 实现了它，但它可以重写为使用基数 data.frame（我猜这样代码会更冗长）。在这种方法中，我将所有模型存储在一个单独的对象中（下面我提供了两个版本的代码，一个是更具解释性的部分，另一个是旨在获得干净输出的更高级的部分）。

当然，你也可以写一个loop/function每次迭代只拟合一个模型而不存储它们。在我看来，保存模型是个好主意，因为您可能需要研究模型的稳健性等，而不仅仅是预测新值。

提示：也请看看@AndS的回答。提供 tidyverse 方法。连同这个答案，我认为，这肯定是 learning/understanding data.table 和 tidyverse 方法

的一个很好的并排比较

# i have used some more simple data to show that the output is correct, see the plots
df <- data.frame(x1 = seq(1, 100, 10),
                 x2 = (1:10)^2,
                 y =  seq(1, 20, 2))
library(data.table)
setDT(df)
# prepare the data by melting it
DT = melt(df, measure.vars = paste0("x", 1:2), value.name = "x")
# also i used a more simple model (in this case lm would also do)
# create model for each variable (formerly columns)
models = setnames(DT[, data.table(list(glm(y ~ x))), by = "variable"], "V1", "model")
# create a new set of data to be predicted
# NOTE: this could, of course, also be added to the models data.table
# as new column via `:=list(...)`
new_pts = setnames(DT[, seq(min(x, na.rm = TRUE), max(x, na.rm = TRUE), len = 200), by = variable], "V1", "x")
# add the predicted values
new_pts[, predicted:= predict(models[variable == unlist(.BY), model][[1]], newdata = as.data.frame(x),  type = "response")
        , by = variable]
# plot and check if it makes sense
plot(df$x1, df$y)
lines(new_pts[variable == "x1", .(x, predicted)])
points(df$x2, df$y)
lines(new_pts[variable == "x2", .(x, predicted)])

# also the following version of above code is possible
# that generates only one new objects in the environment
# but maybe looks more complicated at first sight
# not sure if this is the best way to do it
# data.table experts might provide some shortcuts
setDT(df)
DT = melt(df, measure.vars = paste0("x", 1:2), value.name = "x")
DT = data.table(variable = unique(DT$variable), dat = split(DT, DT$variable))
DT[, models:= list(list(glm(y ~ x, data = dat[[1]]))), by = variable]
DT[, new_pts:= list(list(data.frame(x = dat[[1]][
                                                 ,seq(min(x, na.rm = TRUE)
                                                 , max(x, na.rm = TRUE), len = 200)]
                                    )))
       , by = variable]
models[, predicted:= list(list(data.frame(pred = predict(model[[1]]
                                          , newdata = new_pts[[1]]
                                          ,  type = "response")))),
       by = variable]
plot(df$x1, df$y)
lines(models[variable == "x1", .(unlist(new_pts), unlist(predicted))])
points(df$x2, df$y)
lines(models[variable == "x2", .(unlist(new_pts), unlist(predicted))])

Answer 2

上面的答案做得很好。这是此类事情的另一种选择。首先我们将数据框从宽到长，然后我们按组嵌套数据，然后我们运行每组一个模型，最后我们从模型中映射出预测值并取消嵌套我们的数据框。我绘制了预测值以表明您得到了合理的结果。请注意，在我们取消嵌套数据之前，我们将模型保留在数据框中，并且我们可以在取消嵌套之前提取我们需要的其他信息。

library(tidyverse)

df <- data.frame(x1 = seq(1, 100, 10),
                 x2 = (1:10)^2,
                 y =  seq(1, 20, 2))

pred_df <- df %>% 
  gather(var, val, -y) %>% 
  nest(-var) %>% 
  mutate(model = map(data, ~glm(y~val, data = .)), 
         predicted = map(model, predict)) %>% 
  unnest(data, predicted)

p1 <- pred_df %>% 
  ggplot(aes(x = val, group = var))+
  geom_point(aes(y = y))+
  geom_line(aes(y = predicted))
p1

编辑

这里我们将模型保留在数据框中，然后提取额外的信息。

df %>% 
    gather(var, val, -y) %>% 
    nest(-var) %>% 
    mutate(model = map(data, ~glm(y~val, data = .)), 
           predicted = map(model, predict))
#   var   data              model     predicted 
# 1 x1    <tibble [10 × 2]> <S3: glm> <dbl [10]>
# 2 x2    <tibble [10 × 2]> <S3: glm> <dbl [10]>

现在我们可以提取我们感兴趣的其他信息

df2 <- df %>% 
    gather(var, val, -y) %>% 
    nest(-var) %>% 
    mutate(model = map(data, ~glm(y~val, data = .)), 
           predicted = map(model, predict)) %>%
    mutate(intercept = map(model, ~summary(.x)$coefficients[[1]]),
           slope = map(model, ~summary(.x)$coefficients[[2]]))
df2
#   var   data              model     predicted  intercept slope    
# 1 x1    <tibble [10 × 2]> <S3: glm> <dbl [10]> <dbl [1]> <dbl [1]>
# 2 x2    <tibble [10 × 2]> <S3: glm> <dbl [10]> <dbl [1]> <dbl [1]>

然后我们只是取消嵌套以提取值，但保留其余信息嵌套。

df2 %>% unnest(intercept, slope)
#   var   data              model     predicted  intercept slope
# 1 x1    <tibble [10 × 2]> <S3: glm> <dbl [10]>      0.8  0.200
# 2 x2    <tibble [10 × 2]> <S3: glm> <dbl [10]>      3.35 0.173

另一种选择是制作一个函数，将我们想要的所有数据映射到一个嵌套列表中，然后我们可以根据需要提取我们想要的元素

get_my_info <- function(dat){
    model <- glm(y~val, data = dat)
    predicted <- predict(model)
    intercept <- summary(model)$coefficients[[1]]
    slope <- summary(model)$coefficients[[2]]
    return(list(model = model,predicted = predicted, intercept = intercept, slope = slope))
}

df3 <- df %>% 
    gather(var, val, -y) %>% 
    nest(-var) %>% 
    mutate(info = map(data, get_my_info))
df3
#   var   data              info      
# 1 x1    <tibble [10 × 2]> <list [4]>
# 2 x2    <tibble [10 × 2]> <list [4]>

如果我们想提取预测值

df3 %>% mutate(pred = map(info, ~.x$predicted))
#   var   data              info       pred      
# 1 x1    <tibble [10 × 2]> <list [4]> <dbl [10]>
# 2 x2    <tibble [10 × 2]> <list [4]> <dbl [10]>

R使用列索引号预测glm适合数据框中的每一列

R predict glm fit on each column in data frame using column index number

r

predict

glm