如何从原始数据创建累积辍学率 table

Question

我正在尝试修改此处发布的解决方案

我想使用这些数据创建一个累积辍学率table。

DT<-data.table(
id =c (1,2,3,4,5,6,7,8,9,10,
     11,12,13,14,15,16,17,18,19,20,
     21,22,23,24,25,26,27,28,29,30,31,32,33,34,35),
year =c (2014,2014,2014,2014,2014,2014,2014,2014,2014,2014,
       2015,2015,2015,2015,2015,2015,2015,2015,2015,2015,
   2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016),
cohort =c(1,1,1,1,1,1,1,1,1,1,
        2,2,2,1,1,2,1,2,1,2,
        1,1,3,3,3,2,2,2,2,3,3,3,3,3,3))

到目前为止，我已经能够做到这一点

     library(tidyverse)

DT %>% 
  group_by(year) %>% 
  count(cohort) %>% 
  ungroup() %>% 
  spread(year, n) %>% 
  mutate(y2014_2015_dropouts = (`2014` - `2015`),
         y2015_2016_dropouts =  (`2015` - `2016`)) %>% 
  mutate(y2014_2015_cumulative =y2014_2015_dropouts/`2014`,
         y2015_2016_cumulative =y2015_2016_dropouts/`2014`+y2014_2015_cumulative)%>%


  replace_na(list(y2014_2015_dropouts = 0.0,
                  y2015_2016_dropouts = 0.0)) %>% 
  select(cohort, y2014_2015_dropouts, y2015_2016_dropouts, y2014_2015_cumulative,y2015_2016_cumulative )

累计辍学率 table 反映 class 中多年来辍学的学生比例。

     # A tibble: 3 x 5
  cohort y2014_2015_dropouts y2015_2016_dropouts y2014_2015_cumulative y2015_2016_cumulative
   <dbl>               <dbl>               <dbl>                 <dbl>                 <dbl>
1      1                   6                   2                   0.6                   0.8
2      2                   0                   2                  NA                    NA  
3      3                   0                   0                  NA                    NA  
>

tibble 的最后两列显示，到 2014-2015 年底，60% 的第 1 组学生退学；到 2015-2016 年底，80% 的第 1 组学生已经退学。

我想对群组 2 和群组 3 进行相同的计算，但我不知道该怎么做。

Answer 1

因为您在管道中提前按年传播数据，并且您的 2014 列具有与群组 2 相关的所有内容的 NA 值，因此您需要在计算中合并分母 y2015_2016_cumulative。如果您从当前

中替换该变量的定义

y2015_2016_cumulative =y2015_2016_dropouts/`2014`+y2014_2015_cumulative

至

y2015_2016_cumulative =y2015_2016_dropouts/coalesce(`2014`, `2015`) +
coalesce(y2014_2015_cumulative, 0)

你应该可以开始了。 coalesce 函数尝试第一个参数，但如果第一个参数是 NA，则输入第二个参数。话虽这么说，当前的方法并不是非常可扩展。您必须为您添加的每一年添加额外的合并语句。如果您以整洁的格式保存数据，则可以使用

在年份队列级别保留一个运行列表

DT %>% 
group_by(year) %>% 
count(cohort) %>% 
ungroup() %>% 
group_by(cohort) %>% 
mutate(dropouts = lag(n) - n,
       dropout_rate = dropouts / max(n)) %>% 
replace_na(list(dropouts = 0, n = 0, dropout_rate = 0)) %>% 
mutate(cumulative_dropouts = cumsum(dropouts),
       cumulative_dropout_rate = cumulative_dropouts / max(n))

Answer 2

这是另一种 data.table 解决方案，可以让您的数据以我认为更易于处理的方式组织起来。使用您的 DT 输入数据：

按队列和年份组织和排序：

DT2 <- DT[, .N, list(cohort, year)][order(cohort, year)]

分配年份范围：

DT2[, year := paste(lag(year), year, sep = "_"),]

每年辍学率

DT2[, dropouts := ifelse(!is.na(lag(N)), lag(N) - N, 0), , cohort, ]

获取每个队列每年辍学比例的累计总和：

DT2[, cumul := cumsum(dropouts) / max(N), cohort]

输出：

> DT2
   cohort      year  N dropouts     cumul
1:      1   NA_2014 10        0 0.0000000
2:      1 2014_2015  4        6 0.6000000
3:      1 2015_2016  2        2 0.8000000
4:      2 2016_2015  6        0 0.0000000
5:      2 2015_2016  4        2 0.3333333
6:      3 2016_2016  9        0 0.0000000

如何从原始数据创建累积辍学率 table

How to create a CUMULATIVE dropout rate table from raw data

r

dropout