R使用列索引号预测glm适合数据框中的每一列
R predict glm fit on each column in data frame using column index number
尝试将 BLR 模型拟合到数据框中的每一列,然后对新数据点进行预测。有很多列,因此无法按名称识别列,只能按列号识别。在查看了本网站上几个类似性质的示例后,无法弄清楚为什么这不起作用。
df <- data.frame(x1 = runif(1000, -10, 10),
x2 = runif(1000, -2, 2),
x3 = runif(1000, -5, 5),
y = rbinom(1000, size = 1, prob = 0.40))
for (i in 1:length(df)-1)
{
fit <- glm (y ~ df[,i], data = df, family = binomial, na.action = na.exclude)
new_pts <- data.frame(seq(min(df[,i], na.rm = TRUE), max(df[,i], na.rm = TRUE), len = 200))
names(new_pts) <- names(df[, i])
new_pred <- predict(fit, newdata = new_pts, type = "response")
}
predict()
函数引发警告消息和 returns 数组 1000 个元素长,而测试数据只有 200 个元素。
警告信息:警告信息:
'newdata' 有 200 行,但找到的变量有 1000 行
对于重复建模,我使用了如下所示的类似方法。我已经用 data.table
实现了它,但它可以重写为使用基数 data.frame
(我猜这样代码会更冗长)。在这种方法中,我将所有模型存储在一个单独的对象中(下面我提供了两个版本的代码,一个是更具解释性的部分,另一个是旨在获得干净输出的更高级的部分)。
当然,你也可以写一个loop/function每次迭代只拟合一个模型而不存储它们。在我看来,保存模型是个好主意,因为您可能需要研究模型的稳健性等,而不仅仅是预测新值。
提示:也请看看@AndS的回答。提供 tidyverse 方法。连同这个答案,我认为,这肯定是 learning/understanding data.table 和 tidyverse 方法
的一个很好的并排比较
# i have used some more simple data to show that the output is correct, see the plots
df <- data.frame(x1 = seq(1, 100, 10),
x2 = (1:10)^2,
y = seq(1, 20, 2))
library(data.table)
setDT(df)
# prepare the data by melting it
DT = melt(df, measure.vars = paste0("x", 1:2), value.name = "x")
# also i used a more simple model (in this case lm would also do)
# create model for each variable (formerly columns)
models = setnames(DT[, data.table(list(glm(y ~ x))), by = "variable"], "V1", "model")
# create a new set of data to be predicted
# NOTE: this could, of course, also be added to the models data.table
# as new column via `:=list(...)`
new_pts = setnames(DT[, seq(min(x, na.rm = TRUE), max(x, na.rm = TRUE), len = 200), by = variable], "V1", "x")
# add the predicted values
new_pts[, predicted:= predict(models[variable == unlist(.BY), model][[1]], newdata = as.data.frame(x), type = "response")
, by = variable]
# plot and check if it makes sense
plot(df$x1, df$y)
lines(new_pts[variable == "x1", .(x, predicted)])
points(df$x2, df$y)
lines(new_pts[variable == "x2", .(x, predicted)])
# also the following version of above code is possible
# that generates only one new objects in the environment
# but maybe looks more complicated at first sight
# not sure if this is the best way to do it
# data.table experts might provide some shortcuts
setDT(df)
DT = melt(df, measure.vars = paste0("x", 1:2), value.name = "x")
DT = data.table(variable = unique(DT$variable), dat = split(DT, DT$variable))
DT[, models:= list(list(glm(y ~ x, data = dat[[1]]))), by = variable]
DT[, new_pts:= list(list(data.frame(x = dat[[1]][
,seq(min(x, na.rm = TRUE)
, max(x, na.rm = TRUE), len = 200)]
)))
, by = variable]
models[, predicted:= list(list(data.frame(pred = predict(model[[1]]
, newdata = new_pts[[1]]
, type = "response")))),
by = variable]
plot(df$x1, df$y)
lines(models[variable == "x1", .(unlist(new_pts), unlist(predicted))])
points(df$x2, df$y)
lines(models[variable == "x2", .(unlist(new_pts), unlist(predicted))])
上面的答案做得很好。这是此类事情的另一种选择。首先我们将数据框从宽到长,然后我们按组嵌套数据,然后我们 运行 每组一个模型,最后我们从模型中映射出预测值并取消嵌套我们的数据框。我绘制了预测值以表明您得到了合理的结果。请注意,在我们取消嵌套数据之前,我们将模型保留在数据框中,并且我们可以在取消嵌套之前提取我们需要的其他信息。
library(tidyverse)
df <- data.frame(x1 = seq(1, 100, 10),
x2 = (1:10)^2,
y = seq(1, 20, 2))
pred_df <- df %>%
gather(var, val, -y) %>%
nest(-var) %>%
mutate(model = map(data, ~glm(y~val, data = .)),
predicted = map(model, predict)) %>%
unnest(data, predicted)
p1 <- pred_df %>%
ggplot(aes(x = val, group = var))+
geom_point(aes(y = y))+
geom_line(aes(y = predicted))
p1
编辑
这里我们将模型保留在数据框中,然后提取额外的信息。
df %>%
gather(var, val, -y) %>%
nest(-var) %>%
mutate(model = map(data, ~glm(y~val, data = .)),
predicted = map(model, predict))
# var data model predicted
# 1 x1 <tibble [10 × 2]> <S3: glm> <dbl [10]>
# 2 x2 <tibble [10 × 2]> <S3: glm> <dbl [10]>
现在我们可以提取我们感兴趣的其他信息
df2 <- df %>%
gather(var, val, -y) %>%
nest(-var) %>%
mutate(model = map(data, ~glm(y~val, data = .)),
predicted = map(model, predict)) %>%
mutate(intercept = map(model, ~summary(.x)$coefficients[[1]]),
slope = map(model, ~summary(.x)$coefficients[[2]]))
df2
# var data model predicted intercept slope
# 1 x1 <tibble [10 × 2]> <S3: glm> <dbl [10]> <dbl [1]> <dbl [1]>
# 2 x2 <tibble [10 × 2]> <S3: glm> <dbl [10]> <dbl [1]> <dbl [1]>
然后我们只是取消嵌套以提取值,但保留其余信息嵌套。
df2 %>% unnest(intercept, slope)
# var data model predicted intercept slope
# 1 x1 <tibble [10 × 2]> <S3: glm> <dbl [10]> 0.8 0.200
# 2 x2 <tibble [10 × 2]> <S3: glm> <dbl [10]> 3.35 0.173
另一种选择是制作一个函数,将我们想要的所有数据映射到一个嵌套列表中,然后我们可以根据需要提取我们想要的元素
get_my_info <- function(dat){
model <- glm(y~val, data = dat)
predicted <- predict(model)
intercept <- summary(model)$coefficients[[1]]
slope <- summary(model)$coefficients[[2]]
return(list(model = model,predicted = predicted, intercept = intercept, slope = slope))
}
df3 <- df %>%
gather(var, val, -y) %>%
nest(-var) %>%
mutate(info = map(data, get_my_info))
df3
# var data info
# 1 x1 <tibble [10 × 2]> <list [4]>
# 2 x2 <tibble [10 × 2]> <list [4]>
如果我们想提取预测值
df3 %>% mutate(pred = map(info, ~.x$predicted))
# var data info pred
# 1 x1 <tibble [10 × 2]> <list [4]> <dbl [10]>
# 2 x2 <tibble [10 × 2]> <list [4]> <dbl [10]>
尝试将 BLR 模型拟合到数据框中的每一列,然后对新数据点进行预测。有很多列,因此无法按名称识别列,只能按列号识别。在查看了本网站上几个类似性质的示例后,无法弄清楚为什么这不起作用。
df <- data.frame(x1 = runif(1000, -10, 10),
x2 = runif(1000, -2, 2),
x3 = runif(1000, -5, 5),
y = rbinom(1000, size = 1, prob = 0.40))
for (i in 1:length(df)-1)
{
fit <- glm (y ~ df[,i], data = df, family = binomial, na.action = na.exclude)
new_pts <- data.frame(seq(min(df[,i], na.rm = TRUE), max(df[,i], na.rm = TRUE), len = 200))
names(new_pts) <- names(df[, i])
new_pred <- predict(fit, newdata = new_pts, type = "response")
}
predict()
函数引发警告消息和 returns 数组 1000 个元素长,而测试数据只有 200 个元素。
警告信息:警告信息: 'newdata' 有 200 行,但找到的变量有 1000 行
对于重复建模,我使用了如下所示的类似方法。我已经用 data.table
实现了它,但它可以重写为使用基数 data.frame
(我猜这样代码会更冗长)。在这种方法中,我将所有模型存储在一个单独的对象中(下面我提供了两个版本的代码,一个是更具解释性的部分,另一个是旨在获得干净输出的更高级的部分)。
当然,你也可以写一个loop/function每次迭代只拟合一个模型而不存储它们。在我看来,保存模型是个好主意,因为您可能需要研究模型的稳健性等,而不仅仅是预测新值。
提示:也请看看@AndS的回答。提供 tidyverse 方法。连同这个答案,我认为,这肯定是 learning/understanding data.table 和 tidyverse 方法
的一个很好的并排比较# i have used some more simple data to show that the output is correct, see the plots
df <- data.frame(x1 = seq(1, 100, 10),
x2 = (1:10)^2,
y = seq(1, 20, 2))
library(data.table)
setDT(df)
# prepare the data by melting it
DT = melt(df, measure.vars = paste0("x", 1:2), value.name = "x")
# also i used a more simple model (in this case lm would also do)
# create model for each variable (formerly columns)
models = setnames(DT[, data.table(list(glm(y ~ x))), by = "variable"], "V1", "model")
# create a new set of data to be predicted
# NOTE: this could, of course, also be added to the models data.table
# as new column via `:=list(...)`
new_pts = setnames(DT[, seq(min(x, na.rm = TRUE), max(x, na.rm = TRUE), len = 200), by = variable], "V1", "x")
# add the predicted values
new_pts[, predicted:= predict(models[variable == unlist(.BY), model][[1]], newdata = as.data.frame(x), type = "response")
, by = variable]
# plot and check if it makes sense
plot(df$x1, df$y)
lines(new_pts[variable == "x1", .(x, predicted)])
points(df$x2, df$y)
lines(new_pts[variable == "x2", .(x, predicted)])
# also the following version of above code is possible
# that generates only one new objects in the environment
# but maybe looks more complicated at first sight
# not sure if this is the best way to do it
# data.table experts might provide some shortcuts
setDT(df)
DT = melt(df, measure.vars = paste0("x", 1:2), value.name = "x")
DT = data.table(variable = unique(DT$variable), dat = split(DT, DT$variable))
DT[, models:= list(list(glm(y ~ x, data = dat[[1]]))), by = variable]
DT[, new_pts:= list(list(data.frame(x = dat[[1]][
,seq(min(x, na.rm = TRUE)
, max(x, na.rm = TRUE), len = 200)]
)))
, by = variable]
models[, predicted:= list(list(data.frame(pred = predict(model[[1]]
, newdata = new_pts[[1]]
, type = "response")))),
by = variable]
plot(df$x1, df$y)
lines(models[variable == "x1", .(unlist(new_pts), unlist(predicted))])
points(df$x2, df$y)
lines(models[variable == "x2", .(unlist(new_pts), unlist(predicted))])
上面的答案做得很好。这是此类事情的另一种选择。首先我们将数据框从宽到长,然后我们按组嵌套数据,然后我们 运行 每组一个模型,最后我们从模型中映射出预测值并取消嵌套我们的数据框。我绘制了预测值以表明您得到了合理的结果。请注意,在我们取消嵌套数据之前,我们将模型保留在数据框中,并且我们可以在取消嵌套之前提取我们需要的其他信息。
library(tidyverse)
df <- data.frame(x1 = seq(1, 100, 10),
x2 = (1:10)^2,
y = seq(1, 20, 2))
pred_df <- df %>%
gather(var, val, -y) %>%
nest(-var) %>%
mutate(model = map(data, ~glm(y~val, data = .)),
predicted = map(model, predict)) %>%
unnest(data, predicted)
p1 <- pred_df %>%
ggplot(aes(x = val, group = var))+
geom_point(aes(y = y))+
geom_line(aes(y = predicted))
p1
编辑
这里我们将模型保留在数据框中,然后提取额外的信息。
df %>%
gather(var, val, -y) %>%
nest(-var) %>%
mutate(model = map(data, ~glm(y~val, data = .)),
predicted = map(model, predict))
# var data model predicted
# 1 x1 <tibble [10 × 2]> <S3: glm> <dbl [10]>
# 2 x2 <tibble [10 × 2]> <S3: glm> <dbl [10]>
现在我们可以提取我们感兴趣的其他信息
df2 <- df %>%
gather(var, val, -y) %>%
nest(-var) %>%
mutate(model = map(data, ~glm(y~val, data = .)),
predicted = map(model, predict)) %>%
mutate(intercept = map(model, ~summary(.x)$coefficients[[1]]),
slope = map(model, ~summary(.x)$coefficients[[2]]))
df2
# var data model predicted intercept slope
# 1 x1 <tibble [10 × 2]> <S3: glm> <dbl [10]> <dbl [1]> <dbl [1]>
# 2 x2 <tibble [10 × 2]> <S3: glm> <dbl [10]> <dbl [1]> <dbl [1]>
然后我们只是取消嵌套以提取值,但保留其余信息嵌套。
df2 %>% unnest(intercept, slope)
# var data model predicted intercept slope
# 1 x1 <tibble [10 × 2]> <S3: glm> <dbl [10]> 0.8 0.200
# 2 x2 <tibble [10 × 2]> <S3: glm> <dbl [10]> 3.35 0.173
另一种选择是制作一个函数,将我们想要的所有数据映射到一个嵌套列表中,然后我们可以根据需要提取我们想要的元素
get_my_info <- function(dat){
model <- glm(y~val, data = dat)
predicted <- predict(model)
intercept <- summary(model)$coefficients[[1]]
slope <- summary(model)$coefficients[[2]]
return(list(model = model,predicted = predicted, intercept = intercept, slope = slope))
}
df3 <- df %>%
gather(var, val, -y) %>%
nest(-var) %>%
mutate(info = map(data, get_my_info))
df3
# var data info
# 1 x1 <tibble [10 × 2]> <list [4]>
# 2 x2 <tibble [10 × 2]> <list [4]>
如果我们想提取预测值
df3 %>% mutate(pred = map(info, ~.x$predicted))
# var data info pred
# 1 x1 <tibble [10 × 2]> <list [4]> <dbl [10]>
# 2 x2 <tibble [10 × 2]> <list [4]> <dbl [10]>