使用 purrr 运行 具有变化结果的多元回归模型,然后提取残差
Use purrr to run multiple regression models with changing outcomes and then extract residuals
我想 运行 一些具有不同 y
的回归模型(因此自变量对所有模型保持相同),然后从每个模型中提取残差并添加他们到原始数据集。
我会用diamonds
来展示我的想法:
# In my example, the models are: x or y or z = carat + cut + color + clarity + price
dependent = c("x", "y", "z")
model = function(y, dataset) {
a = map(
setNames(y, y), ~ glm(reformulate(termlabels = c("carat", "cut", "color", "clarity", "price"),
response = y),
family = gaussian(link = "identity"),
data = dataset
))
resids = map_dfr(a, residuals)
df = bind_cols(dataset, resids)
print(df)
}
model(y = dependent, dataset = diamonds)
但是这段代码不起作用。我也想为作为新列添加的残差起一个合理的名称,否则当模型数量很大时很难区分残差。
生成示例数据
library(tidyverse)
set.seed(101)
dd <- diamonds
dependent <- c("x", "y", "z")
for (d in dependent) {
dd[[d]] <- rnorm(nrow(diamonds))
}
进程
library(broom)
res <- (dependent
## set names so .id = argument works downstream
%>% setNames(dependent)
## construct list of formulas
%>% map(reformulate, termlabels = c("carat", "cut", "color", "clarity", "price"))
## fit glmes
%>% map(glm, family = gaussian(link = "identity"), dd,
na.action = na.exclude)
## compute resids (add observation number) and collapse to tibble
%>% map_dfr(~tibble(.obs=seq(nrow(dd)), .resid = residuals(.)), .id = "response")
## widen data → residuals from each response variable as a column
%>% pivot_wider(names_from = "response", values_from = ".resid",
names_prefix ="res_")
%>% select(-.obs)
)
## combine with original data
res2 <- bind_cols(dd, res)
备注:
- 我不明白你为什么要在这里使用
glm(., family = gaussian(link = "identity))
,除非它是你用真实数据做的更复杂的事情的占位符。 (如果这是您的实际模型,那么使用 lm()
会更简单、更快。)
- 将
na.action = na.exclude
添加到 [g]lm()
调用将在预测、残差等中包含 NA
值,这将帮助您的残差与原始数据更好地对齐。
我想 运行 一些具有不同 y
的回归模型(因此自变量对所有模型保持相同),然后从每个模型中提取残差并添加他们到原始数据集。
我会用diamonds
来展示我的想法:
# In my example, the models are: x or y or z = carat + cut + color + clarity + price
dependent = c("x", "y", "z")
model = function(y, dataset) {
a = map(
setNames(y, y), ~ glm(reformulate(termlabels = c("carat", "cut", "color", "clarity", "price"),
response = y),
family = gaussian(link = "identity"),
data = dataset
))
resids = map_dfr(a, residuals)
df = bind_cols(dataset, resids)
print(df)
}
model(y = dependent, dataset = diamonds)
但是这段代码不起作用。我也想为作为新列添加的残差起一个合理的名称,否则当模型数量很大时很难区分残差。
生成示例数据
library(tidyverse)
set.seed(101)
dd <- diamonds
dependent <- c("x", "y", "z")
for (d in dependent) {
dd[[d]] <- rnorm(nrow(diamonds))
}
进程
library(broom)
res <- (dependent
## set names so .id = argument works downstream
%>% setNames(dependent)
## construct list of formulas
%>% map(reformulate, termlabels = c("carat", "cut", "color", "clarity", "price"))
## fit glmes
%>% map(glm, family = gaussian(link = "identity"), dd,
na.action = na.exclude)
## compute resids (add observation number) and collapse to tibble
%>% map_dfr(~tibble(.obs=seq(nrow(dd)), .resid = residuals(.)), .id = "response")
## widen data → residuals from each response variable as a column
%>% pivot_wider(names_from = "response", values_from = ".resid",
names_prefix ="res_")
%>% select(-.obs)
)
## combine with original data
res2 <- bind_cols(dd, res)
备注:
- 我不明白你为什么要在这里使用
glm(., family = gaussian(link = "identity))
,除非它是你用真实数据做的更复杂的事情的占位符。 (如果这是您的实际模型,那么使用lm()
会更简单、更快。) - 将
na.action = na.exclude
添加到[g]lm()
调用将在预测、残差等中包含NA
值,这将帮助您的残差与原始数据更好地对齐。