使用具有 NA 的数据框中的多个类别按行计算斜率和相关统计数据
Calculating slope and associated stats by rows using multiple categories in dataframe that has NAs
我查看了在“How can I calculate the slope of multiple subsets of a data frame more efficiently?”下发布的相关问题,但我的初学者身份不允许我直接对该主题发表评论(不知道该怎么做),所以我在这里问:
如何使用 dplyr&broom 包解决方案避免数据集中的 NA 处理数据框中多个类别的斜率计算?这是脚本和结果的示例?
示例数据:
DOY<-c(102,102,102,102,102,102,102,102,102,102,212,212,212,212,212,212, 212,212,212,212)
LOCATION <- c(1,1,1,1,1,2,2,2,2,2,1,1,1,1,1,3,3,3,3,3)
response <-c(NA,NA,NA,NA,NA,7,10,15,20,30,2,4,6,NA,8,10,15,20,30,NA)
ts <- c(0,10,20,30, 40,0,10,20,30,40,0,10,20,30,40,0,10,20,30,40)
test.data <- data.frame(cbind(DOY, LOCATION, response, ts))
library(dplyr)
library(broom)
test.data2 <- test.data %>% group_by(DOY) %>% do(tidy(lm(response ~ ts, data = .)))
test.data2 %>% filter(term == "ts")
一个条件的结果有效(因为每行有足够的数据,没有 NA):
# A tibble: 2 x 6
# Groups: DOY [2]
# DOY term estimate std.error statistic p.value
# <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 102. ts 0.560 0.0721 7.77 0.00444
# 2 212. ts 0.278 0.247 1.13 0.303
但如果使用多个类别进行分组,则不会:
test.dataX <- test.data %>% group_by(LOCATION, DOY) %>% do(tidy(lm(response ~ ts, data = .)))
错误结果:
# Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
# 0 (non-NA) cases
test.dataX %>% filter(term == "ts")
# Error in eval(lhs, parent, parent) : object 'test.dataX' not found
尝试 2:我在 lm() 中尝试了 na.omit,但这也没有用:
test.dataX <- test.data %>% group_by(LOCATION, DOY) %>% do(tidy(lm(response ~ ts, data = ., na.action=na.omit)))
# Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
# 0 (non-NA) cases
理想情况下,我希望有这样的东西(如果可能,连同 R2 - 如何将其添加到上面的输出中)?
# DOY LOCATION slope R2
# 102 1 NA NA
# 102 2 0.560 0.953
# 212 1 0.149 0.966
# 212 3 0.650 0.966
########################
请多多指教。谢谢!
如果我们想return NA,那么使用possibly
library(tidyverse)
library(broom)
pos1 <- possibly(lm, otherwise = NULL)
prsq <- possibly(pull, otherwise = NA)
test.data %>%
group_by(DOY, LOCATION) %>%
nest %>%
mutate(model = map(data, ~ pos1(response~ ts, data = .x)),
slope = map_dbl(model, ~
.x %>%
tidy %>%
filter(term == 'ts') %>%
prsq(estimate)),
R2 = map_dbl(model, ~
.x %>%
glance %>%
prsq(r.squared))) %>%
select(-data, -model)
# A tibble: 4 x 4
# DOY LOCATION slope R2
# <dbl> <dbl> <dbl> <dbl>
#1 102 1 NA NA
#2 102 2 0.56 0.953
#3 212 1 0.149 0.966
#4 212 3 0.650 0.966
我查看了在“How can I calculate the slope of multiple subsets of a data frame more efficiently?”下发布的相关问题,但我的初学者身份不允许我直接对该主题发表评论(不知道该怎么做),所以我在这里问:
如何使用 dplyr&broom 包解决方案避免数据集中的 NA 处理数据框中多个类别的斜率计算?这是脚本和结果的示例?
示例数据:
DOY<-c(102,102,102,102,102,102,102,102,102,102,212,212,212,212,212,212, 212,212,212,212)
LOCATION <- c(1,1,1,1,1,2,2,2,2,2,1,1,1,1,1,3,3,3,3,3)
response <-c(NA,NA,NA,NA,NA,7,10,15,20,30,2,4,6,NA,8,10,15,20,30,NA)
ts <- c(0,10,20,30, 40,0,10,20,30,40,0,10,20,30,40,0,10,20,30,40)
test.data <- data.frame(cbind(DOY, LOCATION, response, ts))
library(dplyr)
library(broom)
test.data2 <- test.data %>% group_by(DOY) %>% do(tidy(lm(response ~ ts, data = .)))
test.data2 %>% filter(term == "ts")
一个条件的结果有效(因为每行有足够的数据,没有 NA):
# A tibble: 2 x 6
# Groups: DOY [2]
# DOY term estimate std.error statistic p.value
# <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 102. ts 0.560 0.0721 7.77 0.00444
# 2 212. ts 0.278 0.247 1.13 0.303
但如果使用多个类别进行分组,则不会:
test.dataX <- test.data %>% group_by(LOCATION, DOY) %>% do(tidy(lm(response ~ ts, data = .)))
错误结果:
# Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
# 0 (non-NA) cases
test.dataX %>% filter(term == "ts")
# Error in eval(lhs, parent, parent) : object 'test.dataX' not found
尝试 2:我在 lm() 中尝试了 na.omit,但这也没有用:
test.dataX <- test.data %>% group_by(LOCATION, DOY) %>% do(tidy(lm(response ~ ts, data = ., na.action=na.omit)))
# Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
# 0 (non-NA) cases
理想情况下,我希望有这样的东西(如果可能,连同 R2 - 如何将其添加到上面的输出中)?
# DOY LOCATION slope R2
# 102 1 NA NA
# 102 2 0.560 0.953
# 212 1 0.149 0.966
# 212 3 0.650 0.966
########################
请多多指教。谢谢!
如果我们想return NA,那么使用possibly
library(tidyverse)
library(broom)
pos1 <- possibly(lm, otherwise = NULL)
prsq <- possibly(pull, otherwise = NA)
test.data %>%
group_by(DOY, LOCATION) %>%
nest %>%
mutate(model = map(data, ~ pos1(response~ ts, data = .x)),
slope = map_dbl(model, ~
.x %>%
tidy %>%
filter(term == 'ts') %>%
prsq(estimate)),
R2 = map_dbl(model, ~
.x %>%
glance %>%
prsq(r.squared))) %>%
select(-data, -model)
# A tibble: 4 x 4
# DOY LOCATION slope R2
# <dbl> <dbl> <dbl> <dbl>
#1 102 1 NA NA
#2 102 2 0.56 0.953
#3 212 1 0.149 0.966
#4 212 3 0.650 0.966