计算保留因素的面板数据的 5 年平均值
calculate 5 year average of panel data with factors kept
我有一个面板数据集可能看起来像
set.seed(123)
df <- data.frame(
year = rep(2011:2020,5),
county = rep(c("a","b",'c','d','e'), each=10),
state = rep(c("A","B",'C','D','E'), each=10),
country = rep(c("AA","BB",'CC','DD','EE'), each=10),
var1 = runif(50, 0, 50),
var2 = runif(50, 50, 100)
)
我想通过
将面板数据集转换为县的 5 年平均值
df <- df %>%
mutate(period = cut(df$year, seq(2011, 2021, by = 5),right = F)) %>%
group_by(county, period) %>%
summarise_all(mean)
数据集看起来像
county period year state country var1 var2
<chr> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 a [2011,2016) 2013 NA NA 33.1 69.7
2 a [2016,2021) 2018 NA NA 24.7 73.6
3 b [2011,2016) 2013 NA NA 27.6 72.3
4 b [2016,2021) 2018 NA NA 24.7 83.1
5 c [2011,2016) 2013 NA NA 38.7 75.7
6 c [2016,2021) 2018 NA NA 22.8 66.8
7 d [2011,2016) 2013 NA NA 33.8 72.2
8 d [2016,2021) 2018 NA NA 20.0 83.7
9 e [2011,2016) 2013 NA NA 14.9 71.0
10 e [2016,2021) 2018 NA NA 19.6 70.4
例如,暖心消息是
In mean.default(state) :
argument is not numeric or logical: returning NA
有没有什么巧妙的方法(实际上不是合并,我有很多字符列)在转换后保持每个县的时不变性?
我想要的是
county period year state country var1 var2
<chr> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 a [2011,2016) 2013 A AA 33.1 69.7
2 a [2016,2021) 2018 A AA 24.7 73.6
3 b [2011,2016) 2013 B BB 27.6 72.3
4 b [2016,2021) 2018 B BB 24.7 83.1
5 c [2011,2016) 2013 C CC 38.7 75.7
6 c [2016,2021) 2018 C CC 22.8 66.8
7 d [2011,2016) 2013 D DD 33.8 72.2
8 d [2016,2021) 2018 D DD 20.0 83.7
9 e [2011,2016) 2013 E EE 14.9 71.0
10 e [2016,2021) 2018 E EE 19.6 70.4
提前致谢!
summarise_all(mean)
的警告结果不仅计算 var1
和 var2
的平均值,而且计算 state
和 country
的平均值。如果要将 state
和 country
保留为分组列,则应将它们放入 group_by()
:
library(dplyr)
df %>%
group_by(county, state, country,
period = cut(year, seq(2011, 2021, by = 5), right = FALSE)) %>%
summarise_all(mean) %>%
ungroup()
# # A tibble: 10 × 7
# county state country period year var1 var2
# <chr> <chr> <chr> <fct> <dbl> <dbl> <dbl>
# 1 a A AA [2011,2016) 2013 33.1 69.7
# 2 a A AA [2016,2021) 2018 24.7 73.6
# 3 b B BB [2011,2016) 2013 27.6 72.3
# 4 b B BB [2016,2021) 2018 24.7 83.1
# 5 c C CC [2011,2016) 2013 38.7 75.7
# 6 c C CC [2016,2021) 2018 22.8 66.8
# 7 d D DD [2011,2016) 2013 33.8 72.2
# 8 d D DD [2016,2021) 2018 20.0 83.7
# 9 e E EE [2011,2016) 2013 14.9 71.0
# 10 e E EE [2016,2021) 2018 19.6 70.4
如果分组列只是 county
和 period
,并且其他分类变量在每个组中都是唯一的,您可以通过将第一个值保留为 first()
而保留它们做 summarise()
.
df %>%
group_by(county,
period = cut(year, seq(2011, 2021, by = 5), right = FALSE)) %>%
summarise(across(!where(is.numeric), first),
across( where(is.numeric), mean)) %>%
ungroup()
我有一个面板数据集可能看起来像
set.seed(123)
df <- data.frame(
year = rep(2011:2020,5),
county = rep(c("a","b",'c','d','e'), each=10),
state = rep(c("A","B",'C','D','E'), each=10),
country = rep(c("AA","BB",'CC','DD','EE'), each=10),
var1 = runif(50, 0, 50),
var2 = runif(50, 50, 100)
)
我想通过
将面板数据集转换为县的 5 年平均值df <- df %>%
mutate(period = cut(df$year, seq(2011, 2021, by = 5),right = F)) %>%
group_by(county, period) %>%
summarise_all(mean)
数据集看起来像
county period year state country var1 var2
<chr> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 a [2011,2016) 2013 NA NA 33.1 69.7
2 a [2016,2021) 2018 NA NA 24.7 73.6
3 b [2011,2016) 2013 NA NA 27.6 72.3
4 b [2016,2021) 2018 NA NA 24.7 83.1
5 c [2011,2016) 2013 NA NA 38.7 75.7
6 c [2016,2021) 2018 NA NA 22.8 66.8
7 d [2011,2016) 2013 NA NA 33.8 72.2
8 d [2016,2021) 2018 NA NA 20.0 83.7
9 e [2011,2016) 2013 NA NA 14.9 71.0
10 e [2016,2021) 2018 NA NA 19.6 70.4
例如,暖心消息是
In mean.default(state) :
argument is not numeric or logical: returning NA
有没有什么巧妙的方法(实际上不是合并,我有很多字符列)在转换后保持每个县的时不变性? 我想要的是
county period year state country var1 var2
<chr> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 a [2011,2016) 2013 A AA 33.1 69.7
2 a [2016,2021) 2018 A AA 24.7 73.6
3 b [2011,2016) 2013 B BB 27.6 72.3
4 b [2016,2021) 2018 B BB 24.7 83.1
5 c [2011,2016) 2013 C CC 38.7 75.7
6 c [2016,2021) 2018 C CC 22.8 66.8
7 d [2011,2016) 2013 D DD 33.8 72.2
8 d [2016,2021) 2018 D DD 20.0 83.7
9 e [2011,2016) 2013 E EE 14.9 71.0
10 e [2016,2021) 2018 E EE 19.6 70.4
提前致谢!
summarise_all(mean)
的警告结果不仅计算 var1
和 var2
的平均值,而且计算 state
和 country
的平均值。如果要将 state
和 country
保留为分组列,则应将它们放入 group_by()
:
library(dplyr)
df %>%
group_by(county, state, country,
period = cut(year, seq(2011, 2021, by = 5), right = FALSE)) %>%
summarise_all(mean) %>%
ungroup()
# # A tibble: 10 × 7
# county state country period year var1 var2
# <chr> <chr> <chr> <fct> <dbl> <dbl> <dbl>
# 1 a A AA [2011,2016) 2013 33.1 69.7
# 2 a A AA [2016,2021) 2018 24.7 73.6
# 3 b B BB [2011,2016) 2013 27.6 72.3
# 4 b B BB [2016,2021) 2018 24.7 83.1
# 5 c C CC [2011,2016) 2013 38.7 75.7
# 6 c C CC [2016,2021) 2018 22.8 66.8
# 7 d D DD [2011,2016) 2013 33.8 72.2
# 8 d D DD [2016,2021) 2018 20.0 83.7
# 9 e E EE [2011,2016) 2013 14.9 71.0
# 10 e E EE [2016,2021) 2018 19.6 70.4
如果分组列只是 county
和 period
,并且其他分类变量在每个组中都是唯一的,您可以通过将第一个值保留为 first()
而保留它们做 summarise()
.
df %>%
group_by(county,
period = cut(year, seq(2011, 2021, by = 5), right = FALSE)) %>%
summarise(across(!where(is.numeric), first),
across( where(is.numeric), mean)) %>%
ungroup()