忽略时间序列中指定点的计数重置
Ignoring count reset at specified point in time series
我有一个这样的dataframe
(已编辑;添加分组变量measurement_type
):
data <- data.frame(ID = as.factor(c(rep(1, 10),
rep(2, 10))),
measurement_type = as.factor(c(rep("type_1", 5),
rep("type_2", 5),
rep("type_1", 5),
rep("type_2", 5))),
measurement_time = as.POSIXct(c("2014-06-17 04:00:00",
"2014-06-17 11:52:00",
"2014-06-17 18:58:00",
"2014-06-18 02:05:00",
"2014-06-18 08:00:00",
"2014-06-17 05:27:00",
"2014-06-17 11:10:00",
"2014-06-17 17:02:00",
"2014-06-17 23:56:00",
"2014-06-18 07:01:00",
"2014-07-03 16:01:00",
"2014-07-03 19:19:00",
"2014-07-03 23:55:00",
"2014-07-04 08:08:00",
"2014-07-04 13:55:00",
"2014-07-03 22:12:00",
"2014-07-04 08:59:00",
"2014-07-04 14:10:00",
"2014-07-04 17:00:00",
"2014-07-04 23:00:00")),
amount = c(350,470,310,470,650,
175,275,45,255,395,
130,460,540,790,69,
80,210,58,147,326),
entry_time = as.POSIXct(c(rep("2014-06-17 01:53:00", 10),
rep("2014-07-03 14:35:00", 10))))
具有 ID 1
和 ID 2
的受试者在指定的 entry_time
进入,此后,累积 amounts
在特定的 measurement_times
进行测量。但是,每天中午,金额将再次设置回零并重新开始计数(从零开始)。我想要实现的是,一旦中午休息(因此重置为零),它会不断将新的新开始累积量添加到中午之前已经累积的量(按分组变量 measurement_type
分组) .
更新
感谢@Istrel,我使用提供的答案几乎得到了正确的输出:
data %>% as_tibble() %>%
# Check 12 hours passed --> `pm` column
mutate(pm = format(measurement_time, "%H") >= 12) %>%
mutate(date_fct = format(measurement_time, "%Y_%d")) %>%
# Group by ID and `pm`
group_by(ID, measurement_type, date_fct, pm) %>%
# Turn cumsum into actual values
mutate(amount_act = amount - lag(amount, default = 0)) %>%
# Cumsum over ID
ungroup() %>%
group_by(ID, measurement_type) %>%
mutate(amount_cums = cumsum(amount_act)) %>%
ungroup() %>%
select(-c(pm, date_fct, amount_act))
输出
# A tibble: 20 x 6
ID measurement_type measurement_time amount entry_time amount_cums
<fct> <fct> <dttm> <dbl> <dttm> <dbl>
1 1 type_1 2014-06-17 04:00:00 350 2014-06-17 01:53:00 350
2 1 type_1 2014-06-17 11:52:00 470 2014-06-17 01:53:00 470
3 1 type_1 2014-06-17 18:58:00 310 2014-06-17 01:53:00 780
4 1 type_1 2014-06-18 02:05:00 470 2014-06-17 01:53:00 1250
5 1 type_1 2014-06-18 08:00:00 650 2014-06-17 01:53:00 1430
6 1 type_2 2014-06-17 05:27:00 175 2014-06-17 01:53:00 175
7 1 type_2 2014-06-17 11:10:00 275 2014-06-17 01:53:00 275
8 1 type_2 2014-06-17 17:02:00 45 2014-06-17 01:53:00 320
9 1 type_2 2014-06-17 23:56:00 255 2014-06-17 01:53:00 530
10 1 type_2 2014-06-18 07:01:00 395 2014-06-17 01:53:00 925
11 2 type_1 2014-07-03 16:01:00 130 2014-07-03 14:35:00 130
12 2 type_1 2014-07-03 19:19:00 460 2014-07-03 14:35:00 460
13 2 type_1 2014-07-03 23:55:00 540 2014-07-03 14:35:00 540
14 2 type_1 2014-07-04 08:08:00 790 2014-07-03 14:35:00 1330
15 2 type_1 2014-07-04 13:55:00 69 2014-07-03 14:35:00 1399
16 2 type_2 2014-07-03 22:12:00 80 2014-07-03 14:35:00 80
17 2 type_2 2014-07-04 08:59:00 210 2014-07-03 14:35:00 290
18 2 type_2 2014-07-04 14:10:00 58 2014-07-03 14:35:00 348
19 2 type_2 2014-07-04 17:00:00 147 2014-07-03 14:35:00 437
20 2 type_2 2014-07-04 23:00:00 326 2014-07-03 14:35:00 616
如您所见,中午休息的累计总和已正确更新。但是,在过夜的情况下,代码会将午夜后的数量添加到午夜前的总数中。然而,在午夜没有计数重置,并且金额应该简单地累积到午夜之前的金额(没有计数重置)。
然而,在上面显示的输出中,累积金额被添加到午夜之前的值:例如第 10 行,它将 395 的值添加到 530 的 amount_cums(第 9 行),而它应该只是将差值添加到最后一个值 (395 - 255 = 140),第 10 行的正确 amount_cums 为 670。
知道如何调整您的代码吗?
我可以建议这个策略。首先,按 ID、日期 (year_month_day) 和 AM/PM 时间标签对数据进行分组。然后将累积和转换为每组中的原始值。然后按ID和日期分组重新计算累计和。
解决方案可能如下所示:
library(tidyverse)
dat_alt <- data %>% as_tibble() %>%
# Check 12 hours passed --> `pm` column
mutate(pm = format(measurement_time, "%H") >= 12) %>%
mutate(date_fct = format(measurement_time, "%Y_%d")) %>%
# Group by ID and `pm`
group_by(ID, measurement_type, date_fct, pm) %>%
# Turn cumsum into actual values
mutate(amount_act = amount - lag(amount, default = 0)) %>%
# Cumsum over ID
ungroup() %>%
group_by(ID, measurement_type) %>%
mutate(amount_cums = cumsum(amount_act)) %>%
ungroup() %>%
select(-c(pm, date_fct, amount_act))
我有一个这样的dataframe
(已编辑;添加分组变量measurement_type
):
data <- data.frame(ID = as.factor(c(rep(1, 10),
rep(2, 10))),
measurement_type = as.factor(c(rep("type_1", 5),
rep("type_2", 5),
rep("type_1", 5),
rep("type_2", 5))),
measurement_time = as.POSIXct(c("2014-06-17 04:00:00",
"2014-06-17 11:52:00",
"2014-06-17 18:58:00",
"2014-06-18 02:05:00",
"2014-06-18 08:00:00",
"2014-06-17 05:27:00",
"2014-06-17 11:10:00",
"2014-06-17 17:02:00",
"2014-06-17 23:56:00",
"2014-06-18 07:01:00",
"2014-07-03 16:01:00",
"2014-07-03 19:19:00",
"2014-07-03 23:55:00",
"2014-07-04 08:08:00",
"2014-07-04 13:55:00",
"2014-07-03 22:12:00",
"2014-07-04 08:59:00",
"2014-07-04 14:10:00",
"2014-07-04 17:00:00",
"2014-07-04 23:00:00")),
amount = c(350,470,310,470,650,
175,275,45,255,395,
130,460,540,790,69,
80,210,58,147,326),
entry_time = as.POSIXct(c(rep("2014-06-17 01:53:00", 10),
rep("2014-07-03 14:35:00", 10))))
具有 ID 1
和 ID 2
的受试者在指定的 entry_time
进入,此后,累积 amounts
在特定的 measurement_times
进行测量。但是,每天中午,金额将再次设置回零并重新开始计数(从零开始)。我想要实现的是,一旦中午休息(因此重置为零),它会不断将新的新开始累积量添加到中午之前已经累积的量(按分组变量 measurement_type
分组) .
更新
感谢@Istrel,我使用提供的答案几乎得到了正确的输出:
data %>% as_tibble() %>%
# Check 12 hours passed --> `pm` column
mutate(pm = format(measurement_time, "%H") >= 12) %>%
mutate(date_fct = format(measurement_time, "%Y_%d")) %>%
# Group by ID and `pm`
group_by(ID, measurement_type, date_fct, pm) %>%
# Turn cumsum into actual values
mutate(amount_act = amount - lag(amount, default = 0)) %>%
# Cumsum over ID
ungroup() %>%
group_by(ID, measurement_type) %>%
mutate(amount_cums = cumsum(amount_act)) %>%
ungroup() %>%
select(-c(pm, date_fct, amount_act))
输出
# A tibble: 20 x 6
ID measurement_type measurement_time amount entry_time amount_cums
<fct> <fct> <dttm> <dbl> <dttm> <dbl>
1 1 type_1 2014-06-17 04:00:00 350 2014-06-17 01:53:00 350
2 1 type_1 2014-06-17 11:52:00 470 2014-06-17 01:53:00 470
3 1 type_1 2014-06-17 18:58:00 310 2014-06-17 01:53:00 780
4 1 type_1 2014-06-18 02:05:00 470 2014-06-17 01:53:00 1250
5 1 type_1 2014-06-18 08:00:00 650 2014-06-17 01:53:00 1430
6 1 type_2 2014-06-17 05:27:00 175 2014-06-17 01:53:00 175
7 1 type_2 2014-06-17 11:10:00 275 2014-06-17 01:53:00 275
8 1 type_2 2014-06-17 17:02:00 45 2014-06-17 01:53:00 320
9 1 type_2 2014-06-17 23:56:00 255 2014-06-17 01:53:00 530
10 1 type_2 2014-06-18 07:01:00 395 2014-06-17 01:53:00 925
11 2 type_1 2014-07-03 16:01:00 130 2014-07-03 14:35:00 130
12 2 type_1 2014-07-03 19:19:00 460 2014-07-03 14:35:00 460
13 2 type_1 2014-07-03 23:55:00 540 2014-07-03 14:35:00 540
14 2 type_1 2014-07-04 08:08:00 790 2014-07-03 14:35:00 1330
15 2 type_1 2014-07-04 13:55:00 69 2014-07-03 14:35:00 1399
16 2 type_2 2014-07-03 22:12:00 80 2014-07-03 14:35:00 80
17 2 type_2 2014-07-04 08:59:00 210 2014-07-03 14:35:00 290
18 2 type_2 2014-07-04 14:10:00 58 2014-07-03 14:35:00 348
19 2 type_2 2014-07-04 17:00:00 147 2014-07-03 14:35:00 437
20 2 type_2 2014-07-04 23:00:00 326 2014-07-03 14:35:00 616
如您所见,中午休息的累计总和已正确更新。但是,在过夜的情况下,代码会将午夜后的数量添加到午夜前的总数中。然而,在午夜没有计数重置,并且金额应该简单地累积到午夜之前的金额(没有计数重置)。 然而,在上面显示的输出中,累积金额被添加到午夜之前的值:例如第 10 行,它将 395 的值添加到 530 的 amount_cums(第 9 行),而它应该只是将差值添加到最后一个值 (395 - 255 = 140),第 10 行的正确 amount_cums 为 670。
知道如何调整您的代码吗?
我可以建议这个策略。首先,按 ID、日期 (year_month_day) 和 AM/PM 时间标签对数据进行分组。然后将累积和转换为每组中的原始值。然后按ID和日期分组重新计算累计和。
解决方案可能如下所示:
library(tidyverse)
dat_alt <- data %>% as_tibble() %>%
# Check 12 hours passed --> `pm` column
mutate(pm = format(measurement_time, "%H") >= 12) %>%
mutate(date_fct = format(measurement_time, "%Y_%d")) %>%
# Group by ID and `pm`
group_by(ID, measurement_type, date_fct, pm) %>%
# Turn cumsum into actual values
mutate(amount_act = amount - lag(amount, default = 0)) %>%
# Cumsum over ID
ungroup() %>%
group_by(ID, measurement_type) %>%
mutate(amount_cums = cumsum(amount_act)) %>%
ungroup() %>%
select(-c(pm, date_fct, amount_act))