如何通过R中的ID计算分类法术的数量和持续时间

How to calculate number and duration of categorical spells by ID in R

我有一个纵向数据集,每月记录一个人的就业状况,持续 45 个月。我希望能够创建两个变量以添加到此数据集: 1) 每个人花费的总时长 "Unemployed" 2) 失业期数

理想情况下,它也可以跳过 NA 而不打断咒语

我创建了一个示例数据集来简化事情:


    ID <- c(1:10, 1:10, 1:10)
    date <- c("2006-09-01", "2006-09-01", "2006-09-01", "2006-09-01", "2006-09-01", "2006-09-01", "2006-09-01", 
              "2006-09-01", "2006-09-01", "2006-09-01", "2006-10-01", "2006-10-01", "2006-10-01", "2006-10-01", 
              "2006-10-01", "2006-10-01", "2006-10-01", "2006-10-01", "2006-10-01", "2006-10-01", "2006-11-01", 
              "2006-11-01", "2006-11-01", "2006-11-01", "2006-11-01", "2006-11-01", "2006-11-01", "2006-11-01", 
              "2006-11-01", "2006-11-01")
    act <- c("Unemployed", "Employment", "Education", "Education", "Education", "Education", "Education", 
             "Education", "Education", "Unemployed", "Education", "Unemployed", "Unemployed", "Unemployed", 
             "Education", "Education", "Employment", "Education", "Education", "NA", "Unemployed", 
             "Unemployed", "NA", "Unemployed", "Education", "Employment", "Employment", "NA", "Education", 
             "Unemployed")
    df <- data.frame(ID, date, act)
    df[order(ID),]

       ID       date        act
    1   1 2006-09-01 Unemployed
    11  1 2006-10-01  Education
    21  1 2006-11-01 Unemployed
    2   2 2006-09-01 Employment
    12  2 2006-10-01 Unemployed
    22  2 2006-11-01 Unemployed
    3   3 2006-09-01  Education
    13  3 2006-10-01 Unemployed
    23  3 2006-11-01         NA
    4   4 2006-09-01  Education
    14  4 2006-10-01 Unemployed
    24  4 2006-11-01 Unemployed
    5   5 2006-09-01  Education
    15  5 2006-10-01  Education
    25  5 2006-11-01  Education
    6   6 2006-09-01  Education
    16  6 2006-10-01  Education
    26  6 2006-11-01 Employment
    7   7 2006-09-01  Education
    17  7 2006-10-01 Employment
    27  7 2006-11-01 Employment
    8   8 2006-09-01  Education
    18  8 2006-10-01  Education
    28  8 2006-11-01         NA
    9   9 2006-09-01  Education
    19  9 2006-10-01  Education
    29  9 2006-11-01  Education
    10 10 2006-09-01 Unemployed
    20 10 2006-10-01         NA
    30 10 2006-11-01 Unemployed

我尝试了 Roland 在 Calculate duration in R 提出的解决方案,但我不确定如何调整它以通过 ID 给我结果并处理 NAs。


    library(data.table)
    setDT(df)
    df[, date := as.POSIXct(date, format = "%Y-%m-%d", tz = "GMT")]

    glimpse(df)
    df$act <- ifelse(df$act == "Unemployed",1,-1)
    df[, run := cumsum(c(1, diff(act) != 0))]

    df1 <- df[, list(act = unique(act), 
                               duration = difftime(max(date), min(date), unit = "weeks")), 
                        by = run]
    df1
        run act duration
     1:   1   1  0 weeks
     2:   2  -1  0 weeks
     3:   3   1  0 weeks
     4:   4  -1  0 weeks
     5:   5   1  0 weeks
     6:   6  -1  0 weeks
     7:   7   1  0 weeks
     8:   8  -1  0 weeks
     9:   9   1  0 weeks
    10:  10  -1  0 weeks
    11:  11   1  0 weeks

我所追求的是实现这一目标(这里的持续时间以月为单位,但可以是数周或数天):

    ID spell_count duration
1    1           2        2
2    2           1        2
3    3           1        1
...
10  10           1        2

如有任何帮助,我们将不胜感激,任何 links/literature/examples。

谢谢。

使用包 tidyverse,您可以按变量(或更多)分组并非常容易地进行汇总。

在汇总数据之前,我会将列 date 强制转换为 class Date 并将字符串 "NA" 替换为实际缺失值,NA .

library(tidyverse)

is.na(df$act) <- df$act == "NA"
df$date <- as.Date(df$date)

df %>%
  group_by(ID, act) %>%
  summarise(spell_count = sum(act == "Unemployed", na.rm = TRUE),
            duration = difftime(last(date), first(date), units = "weeks")) %>%
  filter(act == "Unemployed") %>%
  select(-act)
## A tibble: 5 x 3
## Groups:   ID [5]
#     ID spell_count duration      
#  <int>       <int> <time>        
#1     1           2 8.714286 weeks
#2     2           2 4.428571 weeks
#3     3           1 0.000000 weeks
#4     4           2 4.428571 weeks
#5    10           2 8.714286 weeks

上面的代码只会给出至少有一个 act == "Unemployed".
的行 如果你想要所有行,下面的基本 R 解决方案就可以做到。

res <- lapply(split(df, df$ID), function(DF){
  i <- DF$act == "Unemployed"
  if(any(i, na.rm = TRUE))
    duration <- difftime(max(DF$date[i], na.rm = TRUE), min(DF$date[i], na.rm = TRUE), units = "weeks")
  else
    duration <- 0
  spell_count <- sum(i, na.rm = TRUE)
  data.frame(ID = DF$ID[1], spell_count, duration)

})

res <- do.call(rbind, res)
row.names(res) <- NULL
res
#   ID spell_count       duration
#1   1           2 8.714286 weeks
#2   2           2 4.428571 weeks
#3   3           1 0.000000 weeks
#4   4           2 4.428571 weeks
#5   5           0 0.000000 weeks
#6   6           0 0.000000 weeks
#7   7           0 0.000000 weeks
#8   8           0 0.000000 weeks
#9   9           0 0.000000 weeks
#10 10           2 8.714286 weeks

我只使用你的第一个代码块,然后在整个持续时间内,我这样做:


    library(data.table)
    setDT(df)
    df_duration = df[act=="Unemployed",.(duration = .N),by = ID]

失业期数有点棘手:


    df_spell_count = df[order(ID,date)]
    df_spell_count <- df_spell_count[!(is.na(act)|act=="NA")]
    df_spell_count[,previous_act := shift(act,1),by = ID]
    df_spell_count<-df_spell_count[act =="Unemployed" & (previous_act!="Unemployed" | is.na(previous_act))]
    df_spell_count<-df_spell_count[,.(spell_count =.N),by = ID]

如果你想合并这两个东西,只需:

df_stats <- merge(df_duration,df_spell_count, by = "ID", all.x = TRUE,all.y = TRUE)

请注意,此 df 不包含那些没有失业期的用户的行。

这是使用 tidyverse 的另一种尝试。 "spells"上的数据是面板数据的常见变换;在 tidyverse 方法中,我认为的技巧是生成一个拼写变量,例如 OP 原始代码中的 "run" 变量。

# libraries
library(tidyverse)
library(zoo)
library(lubridate)

# example dataset
ID <- c(1:10, 1:10, 1:10)
date <- c("2006-09-01", "2006-09-01", "2006-09-01", "2006-09-01", "2006-09-01", "2006-09-01", "2006-09-01", 
          "2006-09-01", "2006-09-01", "2006-09-01", "2006-10-01", "2006-10-01", "2006-10-01", "2006-10-01", 
          "2006-10-01", "2006-10-01", "2006-10-01", "2006-10-01", "2006-10-01", "2006-10-01", "2006-11-01", 
          "2006-11-01", "2006-11-01", "2006-11-01", "2006-11-01", "2006-11-01", "2006-11-01", "2006-11-01", 
          "2006-11-01", "2006-11-01")
act <- c("Unemployed", "Employment", "Education", "Education", "Education", "Education", "Education", 
         "Education", "Education", "Unemployed", "Education", "Unemployed", "Unemployed", "Unemployed", 
         "Education", "Education", "Employment", "Education", "Education", "NA", "Unemployed", 
         "Unemployed", "NA", "Unemployed", "Education", "Employment", "Employment", "NA", "Education", 
         "Unemployed")
df <- data.frame(ID, date, act)
df[order(ID),]

# convert types of some variables (in particular use zoo::yearmon instead of date, since these are actually yearmonth combos)
df$act <- as.character(df$act)
df$date <- lubridate::ymd(df$date)
df$yearmon <- zoo::as.yearmon(df$date)
df$act <- ifelse(df$act=='NA',NA,df$act)


# construct "act2", which is act, except when an NA is surrounded by the SAME act before and after, it is replaced with that same act
# e.g. Unemployed NA Unemployed -> Unemployed Unemployed Unemployed
# e.g. Education NA Unemployed -> stays the same
# (see note at the end of this discussion for more details on this)
df <- df %>% arrange(ID,date)

df <- df %>% group_by(ID) %>% mutate(
  act2 = ifelse(is.na(act) & (lag(act)==lead(act)), lead(act), act)
)

# create "spell" variable, which is like the "run" variable in the example code
# within ID this identifies the spell that is currently taken place 
# --- this is the most important part of the code ---
df <- df %>% group_by(ID) %>% mutate(
  spell = cumsum(coalesce(is.na(act2) | act2!=lag(act2),FALSE)) + 1
)

# add yearmonth + 1 month, in order to do duration calculations
# (I'm again exploiting the fact that your data is monthly. if this were not true, this variable could be lead(date), within ID. but then we'd have to figure out how to deal with ends of the panel, where lead(date) is NA)
df$yearmonplusmonth <- df$yearmon + (1/12)

# construct a dataset of ID-spell combinations
spells <- df %>% group_by(ID,spell) %>% summarize(
  spelltype = first(act2),
  duration = (max(yearmonplusmonth) - min(yearmon))*12
)

# construct a dataset at the ID level, with desired summaries of spells
spellsummary <- spells %>% group_by(ID,spelltype) %>% summarize(
  spell_count = n(),
  duration = sum(duration)
) 

# if there are no spells of a given spelltype, it doesn't appear in spellsummary
# we need to fill out spellsummary with zeroes in ID-spelltype cases where there are no spells:
temp <- expand.grid(ID = unique(spellsummary$ID), spelltype = unique(spellsummary$spelltype))
spellsummary <- full_join(spellsummary,temp,by=c('ID','spelltype'))
spellsummary <- spellsummary %>% mutate_at(vars(spell_count,duration),funs(coalesce(as.numeric(.),0)))
spellsummary <- spellsummary %>% mutate_at(vars(spell_count,duration),funs(round(.,0)))
spellsummary <- spellsummary %>% arrange(ID,spelltype)

# finally, we just want Unemployed spelltype summaries by ID:
spellsummary %>% filter(spelltype=='Unemployed')

# A tibble: 10 x 4
# Groups:   ID [10]
# ID spelltype  spell_count duration
# <int> <chr>            <dbl>    <dbl>
# 1     1 Unemployed           2        2
# 2     2 Unemployed           1        2
# 3     3 Unemployed           1        1
# 4     4 Unemployed           1        2
# 5     5 Unemployed           0        0
# 6     6 Unemployed           0        0
# 7     7 Unemployed           0        0
# 8     8 Unemployed           0        0
# 9     9 Unemployed           0        0
# 10    10 Unemployed           1        3

注意:我在最后一行的持续时间内得到 3,而不是 OP 的所需输出中的 2。原因是我假设 Unemp NA Unemp 实际上是 Unemp Unemp Unemp,既出于 spell_count 的目的,也出于持续时间的目的。 OP 希望 spell_count 是这种情况,但不是持续时间。要实现这一点,一种方法可能是使用 "act" 变量进行持续时间计算,使用 "act2" 变量进行 spell_count 计算——我将其留给 reader.