计算保留因素的面板数据的 5 年平均值

calculate 5 year average of panel data with factors kept

我有一个面板数据集可能看起来像

set.seed(123)
df <- data.frame(
  year = rep(2011:2020,5),
  county = rep(c("a","b",'c','d','e'), each=10),
  state = rep(c("A","B",'C','D','E'), each=10),
  country = rep(c("AA","BB",'CC','DD','EE'), each=10),
  var1 = runif(50, 0, 50),
  var2 = runif(50, 50, 100)
)

我想通过

将面板数据集转换为县的 5 年平均值
df <- df %>% 
  mutate(period = cut(df$year, seq(2011, 2021, by = 5),right = F)) %>% 
  group_by(county, period) %>% 
  summarise_all(mean)

数据集看起来像

   county period       year state country  var1  var2
   <chr>  <fct>       <dbl> <dbl>   <dbl> <dbl> <dbl>
 1 a      [2011,2016)  2013    NA      NA  33.1  69.7
 2 a      [2016,2021)  2018    NA      NA  24.7  73.6
 3 b      [2011,2016)  2013    NA      NA  27.6  72.3
 4 b      [2016,2021)  2018    NA      NA  24.7  83.1
 5 c      [2011,2016)  2013    NA      NA  38.7  75.7
 6 c      [2016,2021)  2018    NA      NA  22.8  66.8
 7 d      [2011,2016)  2013    NA      NA  33.8  72.2
 8 d      [2016,2021)  2018    NA      NA  20.0  83.7
 9 e      [2011,2016)  2013    NA      NA  14.9  71.0
10 e      [2016,2021)  2018    NA      NA  19.6  70.4

例如,暖心消息是

In mean.default(state) :
  argument is not numeric or logical: returning NA

有没有什么巧妙的方法(实际上不是合并,我有很多字符列)在转换后保持每个县的时不变性? 我想要的是

   county period       year state country  var1  var2
   <chr>  <fct>       <dbl> <dbl>   <dbl> <dbl> <dbl>
 1 a      [2011,2016)  2013    A      AA  33.1  69.7
 2 a      [2016,2021)  2018    A      AA  24.7  73.6
 3 b      [2011,2016)  2013    B      BB  27.6  72.3
 4 b      [2016,2021)  2018    B      BB  24.7  83.1
 5 c      [2011,2016)  2013    C      CC  38.7  75.7
 6 c      [2016,2021)  2018    C      CC  22.8  66.8
 7 d      [2011,2016)  2013    D      DD  33.8  72.2
 8 d      [2016,2021)  2018    D      DD  20.0  83.7
 9 e      [2011,2016)  2013    E      EE  14.9  71.0
10 e      [2016,2021)  2018    E      EE  19.6  70.4

提前致谢!

summarise_all(mean) 的警告结果不仅计算 var1var2 的平均值,而且计算 statecountry 的平均值。如果要将 statecountry 保留为分组列,则应将它们放入 group_by():

library(dplyr)

df %>%
  group_by(county, state, country,
           period = cut(year, seq(2011, 2021, by = 5), right = FALSE)) %>%
  summarise_all(mean) %>%
  ungroup()

# # A tibble: 10 × 7
#    county state country period       year  var1  var2
#    <chr>  <chr> <chr>   <fct>       <dbl> <dbl> <dbl>
#  1 a      A     AA      [2011,2016)  2013  33.1  69.7
#  2 a      A     AA      [2016,2021)  2018  24.7  73.6
#  3 b      B     BB      [2011,2016)  2013  27.6  72.3
#  4 b      B     BB      [2016,2021)  2018  24.7  83.1
#  5 c      C     CC      [2011,2016)  2013  38.7  75.7
#  6 c      C     CC      [2016,2021)  2018  22.8  66.8
#  7 d      D     DD      [2011,2016)  2013  33.8  72.2
#  8 d      D     DD      [2016,2021)  2018  20.0  83.7
#  9 e      E     EE      [2011,2016)  2013  14.9  71.0
# 10 e      E     EE      [2016,2021)  2018  19.6  70.4

如果分组列只是 countyperiod,并且其他分类变量在每个组中都是唯一的,您可以通过将第一个值保留为 first() 而保留它们做 summarise().

df %>%
  group_by(county,
           period = cut(year, seq(2011, 2021, by = 5), right = FALSE)) %>%
  summarise(across(!where(is.numeric), first),
            across( where(is.numeric), mean)) %>%
  ungroup()