如果某个观察结果出现不止一次,则通过总结将数据重塑为面板?

Reshaping the data into a panel by summing up if a certain obseravtion appears more than once?

我有四个不同村庄(A、B、C、D)的学校创建年份(year_est)的数据(df_input)。

df_input <- data.frame( school_id= c(1,2,3,4,5,6), village= c("A","B","B","C","D","D"), year_est = c(2002,2002,2004,2001,2004,2004))    

df_output <- data.frame(year= c(2001,2002,2003,2004,2001,2002,2003,2004,2001,2002,2003,2004,2001,2002,2003,2004), 
                        village = c("A","A","A","A","B","B","B","B","C","C","C","C","D","D","D","D"), 
                        school_est=c(0,1,1,1,0,1,1,2,1,1,1,1,0,0,0,2))
                        

我正在尝试重塑 df_ouput 的格式,其中变量“school_est”如果在村子里建立了学校则取值 1,否则保持为 0。

此外,如果在一个特定的村庄建立了不止一所学校,那么变量school_est可以取大于1的值,例如,对于2004年的村庄B df_output.

在我的数据集中也很常见的是,同一年在同一个村庄建立了不止一所学校,2004年的D村就是这种情况。所以在df_output数据集中, school_est 从 2003 年的 0 到 2004 年的值 = 2。

谁能帮我解决这个问题?

我正在使用以下代码生成 df_output:

df_panel <- df_input %>%
  merge(expand.grid(year=2001:2004, Village=.$Village), by="Village") %>% 
  mutate(across(year_est, ~ as.numeric(replace_na(.x <= year, 0))))

我们可以使用complete

library(dplyr)
library(tidyr)
df_input %>% 
   count(village, year = year_est, name = 'school_est') %>% 
   complete(village, year  = min(year):max(year), 
       fill = list(school_est = 0)) %>% 
   mutate(school_est = ave(school_est, village, FUN = cumsum))

-输出

 # A tibble: 16 x 3
   village  year school_est
   <chr>   <dbl>      <dbl>
 1 A        2001          0
 2 A        2002          1
 3 A        2003          1
 4 A        2004          1
 5 B        2001          0
 6 B        2002          1
 7 B        2003          1
 8 B        2004          2
 9 C        2001          1
10 C        2002          1
11 C        2003          1
12 C        2004          1
13 D        2001          0
14 D        2002          0
15 D        2003          0
16 D        2004          2

或使用base R

out <- transform(as.data.frame(table(transform(df_input, 
   year_est = factor(year_est, levels = min(year_est):max(year_est)))[-1])), 
     Freq = ave(Freq, village, FUN = cumsum))
out[order(out$village),]
   village year_est Freq
1        A     2001    0
5        A     2002    1
9        A     2003    1
13       A     2004    1
2        B     2001    0
6        B     2002    1
10       B     2003    1
14       B     2004    2
3        C     2001    1
7        C     2002    1
11       C     2003    1
15       C     2004    1
4        D     2001    0
8        D     2002    0
12       D     2003    0
16       D     2004    2

花费的时间比预期的要长

library(tidyverse)

df_input <- data.frame( school_id= c(1,2,3,4,5,6), village= c("A","B","B","C","D","D"), year_est = c(2002,2002,2004,2001,2004,2004))    


df_input %>%
  group_by(village, year_est) %>%
  summarise(school_est = n(), .groups = 'drop') %>% 
  complete(nesting(village), year_est = seq(min(year_est), max(year_est),1), fill = list(school_est = 0)) %>%
  group_by(village) %>%
  mutate(school_est = cumsum(school_est)) %>%
  ungroup()

#> # A tibble: 16 x 3
#>    village year_est school_est
#>    <chr>      <dbl>      <dbl>
#>  1 A           2001          0
#>  2 A           2002          1
#>  3 A           2003          1
#>  4 A           2004          1
#>  5 B           2001          0
#>  6 B           2002          1
#>  7 B           2003          1
#>  8 B           2004          2
#>  9 C           2001          1
#> 10 C           2002          1
#> 11 C           2003          1
#> 12 C           2004          1
#> 13 D           2001          0
#> 14 D           2002          0
#> 15 D           2003          0
#> 16 D           2004          2

reprex package (v2.0.0)

于 2021-06-28 创建