r 按 ID 的最小最大日期和 ID 内的多个状态更改

Question

我有一个动物追踪数据集，如下所示

 Id         Start       Stop          Status
 78122      10/12/1919  10/12/1919    Birth
 78122      1/18/1966   2/2/1972      In
 78122      2/3/1972    9/8/1972      In
 78122      9/9/1972    1/23/1974     In
 78122      1/24/1974   10/22/1975    Out
 78122      10/23/1975  5/4/1979      Out
 78122      5/5/1979    8/29/1980     Out
 78122      8/30/1980   5/14/1988     Out
 78122      5/15/1988   6/18/1988     In
 78122      6/19/1988   1/12/1989     In
 78122      1/13/1989   2/23/1990     In
 78122      2/24/1990   6/15/1991     Out
 78122      6/16/1991   2/11/1993     Out
 78122      2/12/1993   5/3/1994      Out
 78122      5/4/1994    7/27/1994     In
 78122      7/22/1994   1/25/1996     Out
 78122      1/26/1996   11/13/2001    In
 78122      11/14/2001  11/19/2001    In
 78122      11/20/2001  9/1/2009      In
 78122      9/26/2009   9/26/2009     Death

这种动物出生于 1919 年，但多次进出其本土领地。我想要创建的是这样的数据集。我喜欢按状态总结 min(Start) 和 max(Stop) 日期。

例如：有三行表示该动物在 1/18/1966 到 1/23/1974 之间的领土内。

 Id         Start       Stop          Status
 78122      1/18/1966   2/2/1972      In
 78122      2/3/1972    9/8/1972      In
 78122      9/9/1972    1/23/1974     In

此信息应汇总为 1 行 min(Start) 和 max(Stop) 像这样

 Id         MinStart    MaxStop       Status
 78122      1/18/1966   1/23/1974     In

同样有四行表明该动物在 1/24/1974 到 5/14/1988 之间离开了领地。

 Id         Start       Stop          Status
 78122      1/24/1974   10/22/1975    Out
 78122      10/23/1975  5/4/1979      Out
 78122      5/5/1979    8/29/1980     Out
 78122      8/30/1980   5/14/1988     Out

此信息应汇总为 1 行 min(Start) 和 max(Stop) 像这样

 Id         MinStart    MaxStop       Status
 78122      1/24/1974   5/14/1988     Out

其他进出状态也类似。最终数据集应如下所示。

 Id         MinStart    MaxStop       Status
 78122      10/12/1919  10/12/1919    Birth
 78122      1/18/1966   1/23/1974     In
 78122      1/24/1974   5/14/1988     Out
 78122      5/15/1988   2/23/1990     In
 78122      2/24/1990   5/3/1994      Out
 78122      5/4/1994    7/27/1994     In
 78122      7/28/1994   1/25/1996     Out
 78122      1/26/1996   9/1/2009      In
 78122      9/26/2009   9/26/2009     Death

关于如何根据上述标准重新排列此数据集的任何建议都非常有用。到目前为止我试过了

 test1 <- testcase %>% 
          group_by(ID,Status) %>% 
          summarize(MinStart  = min(Start), MaxStop= max(Stop))

但这似乎不起作用。它只是为所有 In Status 和 Out Status 创建一分钟和停止日期。这是不正确的。

Answer 1

你需要一些 run length encoding。为方便起见，我将使用 data.table::rleid，但如果需要，您可以使用基本版本：

library(data.table)
testcase %>% 
  group_by(Id, RLE = rleid(Status)) %>%
  arrange(Start) %>%
  dplyr::summarise(Start = min(Start), Stop = max(Stop), Status = first(Status))
# A tibble: 9 x 5
# Groups:   Id [1]
     Id   RLE Start      Stop       Status
  <int> <int> <date>     <date>     <chr> 
1 78122     1 1919-10-12 1919-10-12 Birth 
2 78122     2 1966-01-18 1974-01-23 In    
3 78122     3 1974-01-24 1988-05-14 Out   
4 78122     4 1988-05-15 1990-02-23 In    
5 78122     5 1990-02-24 1994-05-03 Out   
6 78122     6 1994-05-04 1994-07-27 In    
7 78122     7 1994-07-22 1996-01-25 Out   
8 78122     8 1996-01-26 2009-09-01 In    
9 78122     9 2009-09-26 2009-09-26 Death

请注意，我将您的日期转换为 class date，我会留给您。否则它们无法正确排序。

这是没有 data.table

的 group_by 调用

...
  group_by(Id, RLE = with(rle(Status), rep(seq_along(lengths), lengths))) %>%
...

Answer 2

实现此目的的一种方法是捕获日期，同时使用 sapply 将它们强制转换为数值，以便以后能够使用 range。然后，在 ave 中，我们在 mapply 中使用 rle，让变量 x 每次 Status 变化时增长 1。我们现在可以很容易地 aggregate ranges 超过 Id 和 x，其中列子集已经给了我们结果，我们只需要将 as.Date 和 cbind x 的后缀转换为 gsub.

d[2:3] <- sapply(d[2:3], function(x) as.Date(x, "%m/%d/%Y"))
f <- function(x) {r <- rle(x)$l;unlist(mapply(rep, seq(r), r))}
d <- transform(d, x=paste(Id, ave(Status, Id, FUN=f), Status))
r <- do.call(data.frame, aggregate(cbind(Start, Stop) ~ Id + x, d, FUN=range))[c(1:3, 6)]
r[3:4] <- lapply(r[3:4], as.Date, origin="1970-01-01")
r <- cbind(r[1], setNames(r[3:4], c("MinStart", "MaxStop")), Status=gsub(".*\s", "", r$x))

结果

r[order(r$Id), ]
#       Id   MinStart    MaxStop Status
# 1  78122 1919-10-12 1919-10-12  Birth
# 2  78122 1966-01-18 1974-01-23     In
# 3  78122 1974-01-24 1988-05-14    Out
# 4  78122 1988-05-15 1990-02-23     In
# 5  78122 1990-02-24 1994-05-03    Out
# 6  78122 1994-05-04 1994-07-27     In
# 7  78122 1994-07-22 1996-01-25    Out
# 8  78122 1996-01-26 2009-09-01     In
# 9  78122 2009-09-26 2009-09-26  Death
# 10 78123 1919-10-12 1919-10-12  Birth
# 11 78123 1966-01-18 1974-01-23     In
# 12 78123 1974-01-24 1988-05-14    Out
# 13 78123 1988-05-15 1990-02-23     In
# 14 78123 1990-02-24 1994-05-03    Out
# 15 78123 1994-05-04 1994-07-27     In
# 16 78123 1994-07-22 1996-01-25    Out
# 17 78123 1996-01-26 2009-09-01     In
# 18 78123 2009-09-26 2009-09-26  Death

数据：

注意： 数据帧加倍，Id 加一以供演示。

d <- structure(list(Id = c(78122L, 78122L, 78122L, 78122L, 78122L, 
78122L, 78122L, 78122L, 78122L, 78122L, 78122L, 78122L, 78122L, 
78122L, 78122L, 78122L, 78122L, 78122L, 78122L, 78122L, 78123L, 
78123L, 78123L, 78123L, 78123L, 78123L, 78123L, 78123L, 78123L, 
78123L, 78123L, 78123L, 78123L, 78123L, 78123L, 78123L, 78123L, 
78123L, 78123L, 78123L), Start = c("10/12/1919", "1/18/1966", 
"2/3/1972", "9/9/1972", "1/24/1974", "10/23/1975", "5/5/1979", 
"8/30/1980", "5/15/1988", "6/19/1988", "1/13/1989", "2/24/1990", 
"6/16/1991", "2/12/1993", "5/4/1994", "7/22/1994", "1/26/1996", 
"11/14/2001", "11/20/2001", "9/26/2009", "10/12/1919", "1/18/1966", 
"2/3/1972", "9/9/1972", "1/24/1974", "10/23/1975", "5/5/1979", 
"8/30/1980", "5/15/1988", "6/19/1988", "1/13/1989", "2/24/1990", 
"6/16/1991", "2/12/1993", "5/4/1994", "7/22/1994", "1/26/1996", 
"11/14/2001", "11/20/2001", "9/26/2009"), Stop = c("10/12/1919", 
"2/2/1972", "9/8/1972", "1/23/1974", "10/22/1975", "5/4/1979", 
"8/29/1980", "5/14/1988", "6/18/1988", "1/12/1989", "2/23/1990", 
"6/15/1991", "2/11/1993", "5/3/1994", "7/27/1994", "1/25/1996", 
"11/13/2001", "11/19/2001", "9/1/2009", "9/26/2009", "10/12/1919", 
"2/2/1972", "9/8/1972", "1/23/1974", "10/22/1975", "5/4/1979", 
"8/29/1980", "5/14/1988", "6/18/1988", "1/12/1989", "2/23/1990", 
"6/15/1991", "2/11/1993", "5/3/1994", "7/27/1994", "1/25/1996", 
"11/13/2001", "11/19/2001", "9/1/2009", "9/26/2009"), Status = c("Birth", 
"In", "In", "In", "Out", "Out", "Out", "Out", "In", "In", "In", 
"Out", "Out", "Out", "In", "Out", "In", "In", "In", "Death", 
"Birth", "In", "In", "In", "Out", "Out", "Out", "Out", "In", 
"In", "In", "Out", "Out", "Out", "In", "Out", "In", "In", "In", 
"Death")), class = "data.frame", row.names = c(NA, -40L))

Answer 3

例如，您可以使用 insurancerating::reduce():

library(insurancerating)
library(dplyr)
library(lubridate)

d %>% 
  mutate(across(c(Start, Stop), lubridate::mdy)) %>%
  insurancerating::reduce(d_date, begin = Start, end = Stop, Id, Status)

      Id Status index      Start       Stop
# 1  78122  Birth     0 1919-10-12 1919-10-12
# 2  78122  Death     0 2009-09-26 2009-09-26
# 3  78122     In     0 1966-01-18 1974-01-23
# 4  78122     In     1 1988-05-15 1990-02-23
# 5  78122     In     2 1994-05-04 1994-07-27
# 6  78122     In     3 1996-01-26 2009-09-01
# 7  78122    Out     0 1974-01-24 1988-05-14
# 8  78122    Out     1 1990-02-24 1994-05-03
# 9  78122    Out     2 1994-07-22 1996-01-25
# 10 78123  Birth     0 1919-10-12 1919-10-12
# 11 78123  Death     0 2009-09-26 2009-09-26
# 12 78123     In     0 1966-01-18 1974-01-23
# 13 78123     In     1 1988-05-15 1990-02-23
# 14 78123     In     2 1994-05-04 1994-07-27
# 15 78123     In     3 1996-01-26 2009-09-01
# 16 78123    Out     0 1974-01-24 1988-05-14
# 17 78123    Out     1 1990-02-24 1994-05-03
# 18 78123    Out     2 1994-07-22 1996-01-25

注：d是@jay.sf

给出的数据

r 按 ID 的最小最大日期和 ID 内的多个状态更改

r min max dates by id and multiple status changes within ID

r

date

dplyr

summarize

结果