使用空间数据从 ddply 函数内部的多元线性回归 (lm) 中提取 p 值
extracting p values from multiple linear regression (lm) inside of a ddply function using spatial data
我有一组空间坐标 (x,y) 数据,其中每个坐标在几年内都有一个响应变量。以下代码生成类似的数据框:
df <- data.frame(
id = rep(1:2, 2),
x = rep(c(25, 30),10),
y = rep(c(100, 200), 10),
year = rep(1980:1989, 2),
response = rnorm(20)
)
结果数据框:
head(df)
id x y year response
1 1 25 100 1980 0.1707431
2 2 30 200 1981 1.3562263
3 1 25 100 1982 -0.4590506
4 2 30 200 1983 1.3238410
5 1 25 100 1984 1.7765772
6 2 30 200 1985 -0.6258069
我想 运行 随时间对每个单元格进行线性回归以获得响应变量的变化。为此,我使用 plyr 和 ddply 命令:
require(plyr)
lm.df <- ddply(df, .(id), function(z)coef(lm(response ~ year, data = z)))
然后我重新组合空间数据(这里的重新组合示例有点简单,但适用于示例。):
points <- data.frame(
id = c(1,2),
x = c(25,30),
y = c(100,200)
)
lm.stack <- merge(points, lm.df, by="id")
colnames(lm.stack) <- c("ID", "x", "y", "intercept", "slope")
print(lm.stack)
ID x y intercept slope
1 1 25 100 257.7291 -0.12985632
2 2 30 200 173.3676 -0.08708068
效果很好。但我想要的是能够提取每个 lm 模型的 adj r 平方值,以便能够通过坐标单元表示显着趋势,最终目标是投影按显着性值着色的单元格图。这是我需要帮助的地方,我非常感激。谢谢!
如果您将模型保存在列表中,那么您可以使用 lapply
或任何其他列表遍历函数轻松地从中提取您想要的任何内容。
lms <- dlply(df, .(id), function(x) lm(response ~ year, data=x))
rsq <- lapply(lms, function(x) summary(x)$adj.r.squared) # get the adjusted r2
您自己制作列表的另一种方法是 lmList
来自 nlme
library(nlme)
lms2 <- lmList(response ~ year | id, data=df)
summary(lms2)$adj.r.squared
我建议组合使用包 dplyr
和 broom
。
这种方法将 return 您需要了解的有关模型的所有信息(有关系数的信息和有关模型本身的信息)作为数据框:
set.seed(25)
df <- data.frame(
id = rep(1:2, 2),
x = rep(c(25, 30),10),
y = rep(c(100, 200), 10),
year = rep(1980:1989, 2),
response = rnorm(20)
)
df
# id x y year response
# 1 1 25 100 1980 -0.21183360
# 2 2 30 200 1981 -1.04159113
# 3 1 25 100 1982 -1.15330756
# 4 2 30 200 1983 0.32153150
# 5 1 25 100 1984 -1.50012988
# 6 2 30 200 1985 -0.44553326
# 7 1 25 100 1986 1.73404543
# 8 2 30 200 1987 0.51129562
# 9 1 25 100 1988 0.09964504
# 10 2 30 200 1989 -0.05789111
# 11 1 25 100 1980 -1.74278763
# 12 2 30 200 1981 -1.32495298
# 13 1 25 100 1982 -0.54793388
# 14 2 30 200 1983 -1.45638428
# 15 1 25 100 1984 0.08268682
# 16 2 30 200 1985 0.92757895
# 17 1 25 100 1986 -0.71676933
# 18 2 30 200 1987 0.96239968
# 19 1 25 100 1988 1.54588458
# 20 2 30 200 1989 -1.00976361
library(dplyr)
library(broom)
df %>%
group_by(id) %>%
do({model = lm(response~year, data=.) # create your model
data.frame(tidy(model), # get coefficient info
glance(model))}) # get model info
# id term estimate std.error statistic p.value r.squared adj.r.squared sigma statistic.1 p.value.1 df logLik AIC BIC deviance df.residual
# 1 1 (Intercept) -492.2144842 213.19252113 -2.308779 0.04978386 0.3996362 0.32459069 0.9611139 5.325253 0.04987162 2 -12.67704 31.35409 32.26184 7.389919 8
# 2 1 year 0.2479705 0.10745580 2.307651 0.04987162 0.3996362 0.32459069 0.9611139 5.325253 0.04987162 2 -12.67704 31.35409 32.26184 7.389919 8
# 3 2 (Intercept) -258.6253012 196.88284243 -1.313600 0.22539989 0.1771294 0.07427055 0.8871395 1.722063 0.22582607 2 -11.87614 29.75227 30.66003 6.296132 8
# 4 2 year 0.1301582 0.09918521 1.312274 0.22582607 0.1771294 0.07427055 0.8871395 1.722063 0.22582607 2 -11.87614 29.75227 30.66003 6.296132 8
如果您真的想要最终数据框中的 x 和 y 列,则可以使用 group_by(id,x,y)
。例如,如果您想要与您提供的输出类似的输出,您可以这样做:
library(dplyr)
library(broom)
library(tidyr)
dd =
df %>%
group_by(id, x, y) %>%
do({model = lm(response~year, data=.) # create your model
data.frame(tidy(model), # get coefficient info
glance(model))}) # get model info
然后select您需要的信息:
dd %>%
select(id, x, y, term, estimate, adj.r.squared)
# id x y term estimate adj.r.squared
# 1 1 25 100 (Intercept) -492.2144842 0.32459069
# 2 1 25 100 year 0.2479705 0.32459069
# 3 2 30 200 (Intercept) -258.6253012 0.07427055
# 4 2 30 200 year 0.1301582 0.07427055
截距一行,斜率一行。
或者,甚至将该数据框重塑为:
dd %>%
select(id, x, y, term, estimate, adj.r.squared) %>%
spread(term, estimate)
# id x y adj.r.squared (Intercept) year
# 1 1 25 100 0.32459069 -492.2145 0.2479705
# 2 2 30 200 0.07427055 -258.6253 0.1301582
列year
是slope
(变量系数year
)
以类似的方式,您还可以使用包 purrr
创建一个新的数据框,其中包含带有相应信息的列表列:
set.seed(25)
df <- data.frame(
id = rep(1:2, 2),
x = rep(c(25, 30),10),
y = rep(c(100, 200), 10),
year = rep(1980:1989, 2),
response = rnorm(20)
)
library(dplyr)
library(broom)
library(tidyr)
library(purrr)
dd = df %>%
group_by(id, x, y) %>% # for each combination of those variables
nest() %>% # nest data (rest of columns)
mutate(Model = map(data, ~lm(response~year, data=.)), # use nested data to build model of interest
Coeff_Info = map(Model, tidy), # get coefficient info
Model_Info = map(Model, glance)) %>% # get model info
ungroup() # forget the grouping
# check how new dataset looks like
dd
# # A tibble: 2 x 7
# id x y data Model Coeff_Info Model_Info
# <int> <dbl> <dbl> <list> <list> <list> <list>
# 1 1 25 100 <tibble [10 x 2]> <S3: lm> <data.frame [2 x 5]> <data.frame [1 x 11]>
# 2 2 30 200 <tibble [10 x 2]> <S3: lm> <data.frame [2 x 5]> <data.frame [1 x 11]>
您仍然可以访问任何您想要的元素,但请记住,您的一些列现在有列表元素:
# get coefficient info for both models
dd$Coeff_Info
# [[1]]
# term estimate std.error statistic p.value
# 1 (Intercept) -492.2144842 213.1925211 -2.308779 0.04978386
# 2 year 0.2479705 0.1074558 2.307651 0.04987162
#
# [[2]]
# term estimate std.error statistic p.value
# 1 (Intercept) -258.6253012 196.88284243 -1.313600 0.2253999
# 2 year 0.1301582 0.09918521 1.312274 0.2258261
# get the r squared value for 1st model
dd %>% # from new dataset
filter(id == 1) %>% # keep all info / rows of 1st model
pull(Model_Info) %>% # get model info
map_dbl("r.squared") # show r squared
# [1] 0.3996362
或者,或者,
dd %>% # from new dataset
unnest(id, Model_Info) %>% # umnest id and model info
filter(id == 1) %>% # keep row of 1st model
pull(r.squared) # show the r squared
# [1] 0.3996362
我有一组空间坐标 (x,y) 数据,其中每个坐标在几年内都有一个响应变量。以下代码生成类似的数据框:
df <- data.frame(
id = rep(1:2, 2),
x = rep(c(25, 30),10),
y = rep(c(100, 200), 10),
year = rep(1980:1989, 2),
response = rnorm(20)
)
结果数据框:
head(df)
id x y year response
1 1 25 100 1980 0.1707431
2 2 30 200 1981 1.3562263
3 1 25 100 1982 -0.4590506
4 2 30 200 1983 1.3238410
5 1 25 100 1984 1.7765772
6 2 30 200 1985 -0.6258069
我想 运行 随时间对每个单元格进行线性回归以获得响应变量的变化。为此,我使用 plyr 和 ddply 命令:
require(plyr)
lm.df <- ddply(df, .(id), function(z)coef(lm(response ~ year, data = z)))
然后我重新组合空间数据(这里的重新组合示例有点简单,但适用于示例。):
points <- data.frame(
id = c(1,2),
x = c(25,30),
y = c(100,200)
)
lm.stack <- merge(points, lm.df, by="id")
colnames(lm.stack) <- c("ID", "x", "y", "intercept", "slope")
print(lm.stack)
ID x y intercept slope
1 1 25 100 257.7291 -0.12985632
2 2 30 200 173.3676 -0.08708068
效果很好。但我想要的是能够提取每个 lm 模型的 adj r 平方值,以便能够通过坐标单元表示显着趋势,最终目标是投影按显着性值着色的单元格图。这是我需要帮助的地方,我非常感激。谢谢!
如果您将模型保存在列表中,那么您可以使用 lapply
或任何其他列表遍历函数轻松地从中提取您想要的任何内容。
lms <- dlply(df, .(id), function(x) lm(response ~ year, data=x))
rsq <- lapply(lms, function(x) summary(x)$adj.r.squared) # get the adjusted r2
您自己制作列表的另一种方法是 lmList
来自 nlme
library(nlme)
lms2 <- lmList(response ~ year | id, data=df)
summary(lms2)$adj.r.squared
我建议组合使用包 dplyr
和 broom
。
这种方法将 return 您需要了解的有关模型的所有信息(有关系数的信息和有关模型本身的信息)作为数据框:
set.seed(25)
df <- data.frame(
id = rep(1:2, 2),
x = rep(c(25, 30),10),
y = rep(c(100, 200), 10),
year = rep(1980:1989, 2),
response = rnorm(20)
)
df
# id x y year response
# 1 1 25 100 1980 -0.21183360
# 2 2 30 200 1981 -1.04159113
# 3 1 25 100 1982 -1.15330756
# 4 2 30 200 1983 0.32153150
# 5 1 25 100 1984 -1.50012988
# 6 2 30 200 1985 -0.44553326
# 7 1 25 100 1986 1.73404543
# 8 2 30 200 1987 0.51129562
# 9 1 25 100 1988 0.09964504
# 10 2 30 200 1989 -0.05789111
# 11 1 25 100 1980 -1.74278763
# 12 2 30 200 1981 -1.32495298
# 13 1 25 100 1982 -0.54793388
# 14 2 30 200 1983 -1.45638428
# 15 1 25 100 1984 0.08268682
# 16 2 30 200 1985 0.92757895
# 17 1 25 100 1986 -0.71676933
# 18 2 30 200 1987 0.96239968
# 19 1 25 100 1988 1.54588458
# 20 2 30 200 1989 -1.00976361
library(dplyr)
library(broom)
df %>%
group_by(id) %>%
do({model = lm(response~year, data=.) # create your model
data.frame(tidy(model), # get coefficient info
glance(model))}) # get model info
# id term estimate std.error statistic p.value r.squared adj.r.squared sigma statistic.1 p.value.1 df logLik AIC BIC deviance df.residual
# 1 1 (Intercept) -492.2144842 213.19252113 -2.308779 0.04978386 0.3996362 0.32459069 0.9611139 5.325253 0.04987162 2 -12.67704 31.35409 32.26184 7.389919 8
# 2 1 year 0.2479705 0.10745580 2.307651 0.04987162 0.3996362 0.32459069 0.9611139 5.325253 0.04987162 2 -12.67704 31.35409 32.26184 7.389919 8
# 3 2 (Intercept) -258.6253012 196.88284243 -1.313600 0.22539989 0.1771294 0.07427055 0.8871395 1.722063 0.22582607 2 -11.87614 29.75227 30.66003 6.296132 8
# 4 2 year 0.1301582 0.09918521 1.312274 0.22582607 0.1771294 0.07427055 0.8871395 1.722063 0.22582607 2 -11.87614 29.75227 30.66003 6.296132 8
如果您真的想要最终数据框中的 x 和 y 列,则可以使用 group_by(id,x,y)
。例如,如果您想要与您提供的输出类似的输出,您可以这样做:
library(dplyr)
library(broom)
library(tidyr)
dd =
df %>%
group_by(id, x, y) %>%
do({model = lm(response~year, data=.) # create your model
data.frame(tidy(model), # get coefficient info
glance(model))}) # get model info
然后select您需要的信息:
dd %>%
select(id, x, y, term, estimate, adj.r.squared)
# id x y term estimate adj.r.squared
# 1 1 25 100 (Intercept) -492.2144842 0.32459069
# 2 1 25 100 year 0.2479705 0.32459069
# 3 2 30 200 (Intercept) -258.6253012 0.07427055
# 4 2 30 200 year 0.1301582 0.07427055
截距一行,斜率一行。
或者,甚至将该数据框重塑为:
dd %>%
select(id, x, y, term, estimate, adj.r.squared) %>%
spread(term, estimate)
# id x y adj.r.squared (Intercept) year
# 1 1 25 100 0.32459069 -492.2145 0.2479705
# 2 2 30 200 0.07427055 -258.6253 0.1301582
列year
是slope
(变量系数year
)
以类似的方式,您还可以使用包 purrr
创建一个新的数据框,其中包含带有相应信息的列表列:
set.seed(25)
df <- data.frame(
id = rep(1:2, 2),
x = rep(c(25, 30),10),
y = rep(c(100, 200), 10),
year = rep(1980:1989, 2),
response = rnorm(20)
)
library(dplyr)
library(broom)
library(tidyr)
library(purrr)
dd = df %>%
group_by(id, x, y) %>% # for each combination of those variables
nest() %>% # nest data (rest of columns)
mutate(Model = map(data, ~lm(response~year, data=.)), # use nested data to build model of interest
Coeff_Info = map(Model, tidy), # get coefficient info
Model_Info = map(Model, glance)) %>% # get model info
ungroup() # forget the grouping
# check how new dataset looks like
dd
# # A tibble: 2 x 7
# id x y data Model Coeff_Info Model_Info
# <int> <dbl> <dbl> <list> <list> <list> <list>
# 1 1 25 100 <tibble [10 x 2]> <S3: lm> <data.frame [2 x 5]> <data.frame [1 x 11]>
# 2 2 30 200 <tibble [10 x 2]> <S3: lm> <data.frame [2 x 5]> <data.frame [1 x 11]>
您仍然可以访问任何您想要的元素,但请记住,您的一些列现在有列表元素:
# get coefficient info for both models
dd$Coeff_Info
# [[1]]
# term estimate std.error statistic p.value
# 1 (Intercept) -492.2144842 213.1925211 -2.308779 0.04978386
# 2 year 0.2479705 0.1074558 2.307651 0.04987162
#
# [[2]]
# term estimate std.error statistic p.value
# 1 (Intercept) -258.6253012 196.88284243 -1.313600 0.2253999
# 2 year 0.1301582 0.09918521 1.312274 0.2258261
# get the r squared value for 1st model
dd %>% # from new dataset
filter(id == 1) %>% # keep all info / rows of 1st model
pull(Model_Info) %>% # get model info
map_dbl("r.squared") # show r squared
# [1] 0.3996362
或者,或者,
dd %>% # from new dataset
unnest(id, Model_Info) %>% # umnest id and model info
filter(id == 1) %>% # keep row of 1st model
pull(r.squared) # show the r squared
# [1] 0.3996362