忽略时间序列中指定点的计数重置

Question

我有一个这样的dataframe（已编辑；添加分组变量measurement_type）：

data <- data.frame(ID = as.factor(c(rep(1, 10),
                                    rep(2, 10))),
                   measurement_type = as.factor(c(rep("type_1", 5),
                                                  rep("type_2", 5),
                                                  rep("type_1", 5),
                                                  rep("type_2", 5))),
                   measurement_time = as.POSIXct(c("2014-06-17 04:00:00",
                                                   "2014-06-17 11:52:00",
                                                   "2014-06-17 18:58:00",
                                                   "2014-06-18 02:05:00",
                                                   "2014-06-18 08:00:00",
                                                   "2014-06-17 05:27:00",
                                                   "2014-06-17 11:10:00",
                                                   "2014-06-17 17:02:00",
                                                   "2014-06-17 23:56:00",
                                                   "2014-06-18 07:01:00",
                                                   "2014-07-03 16:01:00",
                                                   "2014-07-03 19:19:00",
                                                   "2014-07-03 23:55:00",
                                                   "2014-07-04 08:08:00",
                                                   "2014-07-04 13:55:00",
                                                   "2014-07-03 22:12:00",
                                                   "2014-07-04 08:59:00",
                                                   "2014-07-04 14:10:00",
                                                   "2014-07-04 17:00:00",
                                                   "2014-07-04 23:00:00")),
                   amount = c(350,470,310,470,650,
                              175,275,45,255,395,
                              130,460,540,790,69,
                              80,210,58,147,326),
                   entry_time = as.POSIXct(c(rep("2014-06-17 01:53:00", 10),
                                             rep("2014-07-03 14:35:00", 10))))

具有 ID 1 和 ID 2 的受试者在指定的 entry_time 进入，此后，累积 amounts 在特定的 measurement_times 进行测量。但是，每天中午，金额将再次设置回零并重新开始计数（从零开始）。我想要实现的是，一旦中午休息（因此重置为零），它会不断将新的新开始累积量添加到中午之前已经累积的量（按分组变量 measurement_type 分组） .

更新

感谢@Istrel，我使用提供的答案几乎得到了正确的输出：

data %>% as_tibble() %>%
  # Check 12 hours passed --> `pm` column
  mutate(pm = format(measurement_time, "%H") >= 12) %>%
  mutate(date_fct = format(measurement_time, "%Y_%d")) %>%
  # Group by ID and `pm`
  group_by(ID, measurement_type, date_fct, pm) %>%
  # Turn cumsum into actual values
  mutate(amount_act = amount - lag(amount, default = 0)) %>%
  # Cumsum over ID
  ungroup() %>%
  group_by(ID, measurement_type) %>%
  mutate(amount_cums = cumsum(amount_act)) %>%
  ungroup() %>%
  select(-c(pm, date_fct, amount_act))

输出

# A tibble: 20 x 6
   ID    measurement_type measurement_time    amount entry_time          amount_cums
   <fct> <fct>            <dttm>               <dbl> <dttm>                    <dbl>
 1 1     type_1           2014-06-17 04:00:00    350 2014-06-17 01:53:00         350
 2 1     type_1           2014-06-17 11:52:00    470 2014-06-17 01:53:00         470
 3 1     type_1           2014-06-17 18:58:00    310 2014-06-17 01:53:00         780
 4 1     type_1           2014-06-18 02:05:00    470 2014-06-17 01:53:00        1250
 5 1     type_1           2014-06-18 08:00:00    650 2014-06-17 01:53:00        1430
 6 1     type_2           2014-06-17 05:27:00    175 2014-06-17 01:53:00         175
 7 1     type_2           2014-06-17 11:10:00    275 2014-06-17 01:53:00         275
 8 1     type_2           2014-06-17 17:02:00     45 2014-06-17 01:53:00         320
 9 1     type_2           2014-06-17 23:56:00    255 2014-06-17 01:53:00         530
10 1     type_2           2014-06-18 07:01:00    395 2014-06-17 01:53:00         925
11 2     type_1           2014-07-03 16:01:00    130 2014-07-03 14:35:00         130
12 2     type_1           2014-07-03 19:19:00    460 2014-07-03 14:35:00         460
13 2     type_1           2014-07-03 23:55:00    540 2014-07-03 14:35:00         540
14 2     type_1           2014-07-04 08:08:00    790 2014-07-03 14:35:00        1330
15 2     type_1           2014-07-04 13:55:00     69 2014-07-03 14:35:00        1399
16 2     type_2           2014-07-03 22:12:00     80 2014-07-03 14:35:00          80
17 2     type_2           2014-07-04 08:59:00    210 2014-07-03 14:35:00         290
18 2     type_2           2014-07-04 14:10:00     58 2014-07-03 14:35:00         348
19 2     type_2           2014-07-04 17:00:00    147 2014-07-03 14:35:00         437
20 2     type_2           2014-07-04 23:00:00    326 2014-07-03 14:35:00         616

如您所见，中午休息的累计总和已正确更新。但是，在过夜的情况下，代码会将午夜后的数量添加到午夜前的总数中。然而，在午夜没有计数重置，并且金额应该简单地累积到午夜之前的金额（没有计数重置）。然而，在上面显示的输出中，累积金额被添加到午夜之前的值：例如第 10 行，它将 395 的值添加到 530 的 amount_cums（第 9 行），而它应该只是将差值添加到最后一个值 (395 - 255 = 140)，第 10 行的正确 amount_cums 为 670。

知道如何调整您的代码吗？

Answer 1

我可以建议这个策略。首先，按 ID、日期 (year_month_day) 和 AM/PM 时间标签对数据进行分组。然后将累积和转换为每组中的原始值。然后按ID和日期分组重新计算累计和。

解决方案可能如下所示：

library(tidyverse)

dat_alt <- data %>% as_tibble() %>%
    # Check 12 hours passed --> `pm` column
    mutate(pm = format(measurement_time, "%H") >= 12) %>%
    mutate(date_fct = format(measurement_time, "%Y_%d")) %>%
    # Group by ID and `pm`
    group_by(ID, measurement_type, date_fct, pm) %>%
    # Turn cumsum into actual values
    mutate(amount_act = amount - lag(amount, default = 0)) %>%
    # Cumsum over ID
    ungroup() %>%
    group_by(ID, measurement_type) %>%
    mutate(amount_cums = cumsum(amount_act)) %>%
    ungroup() %>%
    select(-c(pm, date_fct, amount_act))

忽略时间序列中指定点的计数重置

Ignoring count reset at specified point in time series

r

time-series