如何使用 lm、do、broom 和 dplyr 按组计算回归?
how to compute regressions by group with lm, do, broom and dplyr?
考虑这个简单的例子
> dataframe <- data_frame(id = c(1,2,3,4,5,6),
+ group = c(1,1,1,2,2,2),
+ value = c(200,400,120,300,100,100))
> dataframe
# A tibble: 6 x 3
id group value
<dbl> <dbl> <dbl>
1 1 1 200
2 2 1 400
3 3 1 120
4 4 2 300
5 5 2 100
6 6 2 100
这里我想对常量使用回归 value
,按组 group
。我有 get_mean()
函数
get_mean <- function(data, myvar){
col_name <- as.character(substitute(myvar))
fmla <- as.formula(paste(col_name, "~ 1"))
tidy(lm(data = data,fmla)) %>% pull(estimate)
}
天真的方法:
dataframe %>% group_by(group) %>% mutate(bug = get_mean(., value),
Ineedthis = max(value))
# A tibble: 6 x 5
# Groups: group [2]
id group value bug Ineedthis
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 200 203.3333 400
2 2 1 400 203.3333 400
3 3 1 120 203.3333 400
4 4 2 300 203.3333 300
5 5 2 100 203.3333 300
6 6 2 100 203.3333 300
失败因为你可以看到平均值不是按组计算的。
众所周知,使用 do
会奏效。
dataframe %>% group_by(group) %>% do(bug = get_mean(., value))
Source: local data frame [2 x 2]
Groups: <by row>
# A tibble: 2 x 2
group bug
* <dbl> <list>
1 1 <dbl [1]>
2 2 <dbl [1]>
但是,我不知道如何使用 do
获取另一个 Ineedthis
变量,我不知道如何取消列出 bug
变量。我希望我的输出是:
# A tibble: 6 x 5
id group value good Ineedthis
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 200 240 400
2 2 1 400 240 400
3 3 1 120 240 400
4 4 2 300 166.6666 300
5 5 2 100 166.6666 300
6 6 2 100 166.6666 300
有什么想法吗?谢谢!!
这是一个很酷的解决方案,可以重现预期的输出。不确定这是更好的解决方案,但仍然值得与我的编码爱好者分享 :)
get_output <- function(dataframe){
temp <- dataframe %>%
group_by(group) %>%
do({mymean = get_mean(., value);
myother = max(.$value);
dplyr::data_frame(mean = mymean,
other = myother)})
dataframe %>% left_join(temp)
}
> get_output(dataframe)
Joining, by = "group"
# A tibble: 6 x 5
id group value mean other
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 200 240.0000 400
2 2 1 400 240.0000 400
3 3 1 120 240.0000 400
4 4 2 300 166.6667 300
5 5 2 100 166.6667 300
6 6 2 100 166.6667 300
我对您的 get_mean
函数做了一些更改,但它在功能上是一样的。参见:
get_mean <- function(., myvar){
dat <- substitute(myvar) %>% data.frame(.) %>% setNames('vec')
out <- lm(data = dat,'vec ~ 1')$coefficients[1] %>% unname(.)
return(out)
}
允许我们做:
dataframe %>%
group_by(group) %>%
summarise(good = get_mean(., value), Ineedthis= max(value)) %>%
left_join(dataframe, ., by = 'group')
导致:
id group value good Ineedthis
1 1 1 200 240.0000 400
2 2 1 400 240.0000 400
3 3 1 120 240.0000 400
4 4 2 300 166.6667 300
5 5 2 100 166.6667 300
6 6 2 100 166.6667 300
考虑这个简单的例子
> dataframe <- data_frame(id = c(1,2,3,4,5,6),
+ group = c(1,1,1,2,2,2),
+ value = c(200,400,120,300,100,100))
> dataframe
# A tibble: 6 x 3
id group value
<dbl> <dbl> <dbl>
1 1 1 200
2 2 1 400
3 3 1 120
4 4 2 300
5 5 2 100
6 6 2 100
这里我想对常量使用回归 value
,按组 group
。我有 get_mean()
函数
get_mean <- function(data, myvar){
col_name <- as.character(substitute(myvar))
fmla <- as.formula(paste(col_name, "~ 1"))
tidy(lm(data = data,fmla)) %>% pull(estimate)
}
天真的方法:
dataframe %>% group_by(group) %>% mutate(bug = get_mean(., value),
Ineedthis = max(value))
# A tibble: 6 x 5
# Groups: group [2]
id group value bug Ineedthis
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 200 203.3333 400
2 2 1 400 203.3333 400
3 3 1 120 203.3333 400
4 4 2 300 203.3333 300
5 5 2 100 203.3333 300
6 6 2 100 203.3333 300
失败因为你可以看到平均值不是按组计算的。
众所周知,使用 do
会奏效。
dataframe %>% group_by(group) %>% do(bug = get_mean(., value))
Source: local data frame [2 x 2]
Groups: <by row>
# A tibble: 2 x 2
group bug
* <dbl> <list>
1 1 <dbl [1]>
2 2 <dbl [1]>
但是,我不知道如何使用 do
获取另一个 Ineedthis
变量,我不知道如何取消列出 bug
变量。我希望我的输出是:
# A tibble: 6 x 5
id group value good Ineedthis
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 200 240 400
2 2 1 400 240 400
3 3 1 120 240 400
4 4 2 300 166.6666 300
5 5 2 100 166.6666 300
6 6 2 100 166.6666 300
有什么想法吗?谢谢!!
这是一个很酷的解决方案,可以重现预期的输出。不确定这是更好的解决方案,但仍然值得与我的编码爱好者分享 :)
get_output <- function(dataframe){
temp <- dataframe %>%
group_by(group) %>%
do({mymean = get_mean(., value);
myother = max(.$value);
dplyr::data_frame(mean = mymean,
other = myother)})
dataframe %>% left_join(temp)
}
> get_output(dataframe)
Joining, by = "group"
# A tibble: 6 x 5
id group value mean other
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 200 240.0000 400
2 2 1 400 240.0000 400
3 3 1 120 240.0000 400
4 4 2 300 166.6667 300
5 5 2 100 166.6667 300
6 6 2 100 166.6667 300
我对您的 get_mean
函数做了一些更改,但它在功能上是一样的。参见:
get_mean <- function(., myvar){
dat <- substitute(myvar) %>% data.frame(.) %>% setNames('vec')
out <- lm(data = dat,'vec ~ 1')$coefficients[1] %>% unname(.)
return(out)
}
允许我们做:
dataframe %>%
group_by(group) %>%
summarise(good = get_mean(., value), Ineedthis= max(value)) %>%
left_join(dataframe, ., by = 'group')
导致:
id group value good Ineedthis
1 1 1 200 240.0000 400
2 2 1 400 240.0000 400
3 3 1 120 240.0000 400
4 4 2 300 166.6667 300
5 5 2 100 166.6667 300
6 6 2 100 166.6667 300