R中单个状态变量变化之间的累积时间总和

Sum cumulative time between changes in a single status variable in R

几个小时以来,我一直在寻找答案并修改我的代码。对于特定 ID,我有一个如下所示的数据集:

# A tibble: 14 × 3
        ID state orderDate          
     <dbl> <chr> <dttm>             
 1 4227631 1     2022-03-14 19:00:00
 2 4227631 1     2022-03-14 20:00:00
 3 4227631 1     2022-03-15 11:00:00
 4 4227631 0     2022-03-15 11:00:00
 5 4227631 1     2022-03-15 20:00:00
 6 4227631 1     2022-03-16 04:00:00
 7 4227631 0     2022-03-16 04:00:00
 8 4227631 1     2022-03-16 05:00:00
 9 4227631 0     2022-03-16 13:00:00
10 4227631 1     2022-03-16 15:00:00

数百个 ID 都会出现这种情况。对于这个例子,我使用 dplyr 到 group_by ID。我只关心值之间的状态变化,而不关心它是否保持不变。

我想计算每个ID保持状态1的累计时间。状态1在改变之前重复多次的情况应该被忽略。我一直在计划使用 lubridate 和 dplyr 来执行分析。

我在这个例子中使用的 Tibble:

structure(list(ID = c(4227631, 4227631, 4227631, 4227631, 4227631, 
4227631, 4227631, 4227631, 4227631, 4227631), state = c("1", 
"1", "1", "0", "1", "1", "0", "1", "0", "1"), orderDate = structure(c(1647284400, 
1647288000, 1647342000, 1647342000, 1647374400, 1647403200, 1647403200, 
1647406800, 1647435600, 1647442800), tzone = "UTC", class = c("POSIXct", 
"POSIXt"))), row.names = c(NA, -10L), class = c("tbl_df", "tbl", 
"data.frame"))

我尝试了各种解决方案,例如 ,但是我在 lag 方面遇到了问题,因此无法将其纳入此特定分析。

预期的输出可能如下所示:

然后我计划将所有 statusOne 加在一起,计算出在该状态下花费的累计时间。

邀请所有更优雅的解决方案,或者如果有人对先前的问题有link。


编辑 使用下面的解决方案我想通了! 该解决方案没有查看状态 0 紧跟在状态 1 之后的情况,我们想查看这些状态之间经过的总时间。

df %>%
  group_by(ID) %>%
  mutate(max = cumsum(ifelse(orderName == lag(orderName, default = "1"), 0, 1))) %>%
  mutate(hours1 = ifelse(max == lag(max) &
                           orderName=="1", difftime(orderDate, lag(orderDate), units = "h"), NA))  %>% 
  mutate(hours2 = ifelse(orderName=="0" & lag(orderName)=="1", 
                         difftime(orderDate, lag(orderDate), units = "h"), NA)) %>% 
  mutate(hours1 = replace_na(hours1, 0), 
         hours2 = replace_na(hours2, 0)) %>% 
  mutate(hours = hours1+hours2) %>% 
  select(-hours1, -hours2) %>% 
  summarise(total_hours = sum(hours, na.rm = TRUE)) %>% 
  filter(total_hours!=0)

这远非优雅,但至少它似乎提供了正确的答案:

library(tidyverse)

df <- structure(list(ID = c(4227631, 4227631, 4227631, 4227631, 4227631, 
                            4227631, 4227631, 4227631, 4227631, 4227631),
                     state = c("1", "1", "1", "0", "1", "1", "0", "1", "0", "1"),
                     orderDate = structure(c(1647284400, 1647288000, 1647342000, 
                                             1647342000, 1647374400, 1647403200,
                                             1647403200, 1647406800, 1647435600, 
                                             1647442800), 
                                           tzone = "UTC",
                                           class = c("POSIXct", "POSIXt"))),
                row.names = c(NA, -10L),
                class = c("tbl_df", "tbl", "data.frame"))

df2 <- df %>%
  group_by(ID) %>%
  mutate(tmp = ifelse(state == lag(state, default = "1"), 0, 1),
         max = cumsum(tmp)) %>%
  mutate(hours = ifelse(max == lag(max), difftime(orderDate, lag(orderDate), units = "h"), NA)) %>%
  select(-tmp)

df3 <- df2 %>%
  group_by(max) %>%
  summarise(max, statusOne = sum(hours, na.rm = TRUE))

df4 <- left_join(df2, df3, by = "max") %>%
  distinct() %>%
  select(-c(max, hours)) %>%
  mutate(statusOne = ifelse(statusOne != 0 & lag(statusOne, default = 1) == statusOne, 0, statusOne))

df4
#> # A tibble: 10 × 4
#> # Groups:   ID [1]
#>         ID state orderDate           statusOne
#>      <dbl> <chr> <dttm>                  <dbl>
#>  1 4227631 1     2022-03-14 19:00:00        16
#>  2 4227631 1     2022-03-14 20:00:00         0
#>  3 4227631 1     2022-03-15 11:00:00         0
#>  4 4227631 0     2022-03-15 11:00:00         0
#>  5 4227631 1     2022-03-15 20:00:00         8
#>  6 4227631 1     2022-03-16 04:00:00         0
#>  7 4227631 0     2022-03-16 04:00:00         0
#>  8 4227631 1     2022-03-16 05:00:00         0
#>  9 4227631 0     2022-03-16 13:00:00         0
#> 10 4227631 1     2022-03-16 15:00:00         0

reprex package (v2.0.1)

于 2022-04-04 创建

编辑

为每个 ID 获取 total_hours state=1 更简单:

df %>%
  group_by(ID) %>%
  mutate(max = cumsum(ifelse(state == lag(state, default = "1"), 0, 1))) %>%
  mutate(hours = ifelse(max == lag(max), difftime(orderDate, lag(orderDate), units = "h"), NA)) %>%
  summarise(total_hours = sum(hours, na.rm = TRUE))
#> # A tibble: 1 × 2
#>        ID total_hours
#>     <dbl>       <dbl>
#> 1 4227631          24

reprex package (v2.0.1)

于 2022-04-04 创建