如何每 15 分钟计算一次 "batch"
how to count a "batch" every 15 minutes
1 2021-01-01 12:59:38
2 2021-01-01 14:08:59
3 2021-01-01 14:09:08
4 2021-01-01 14:11:30
5 2021-01-01 14:22:19
6 2021-01-01 14:41:07
我希望能够每 15 分钟计算一次条目数,但要在滚动的基础上进行。例如 12:59 将在 15 分钟内出现 1 个,14:08 => 14:22 将在 15 分钟内出现,因此这批 return 4 个,最后 14:41将在另一个 15 分钟的批次中单独出现。
我希望这是有道理的,并提前致谢。
抱歉没有包括这个
> dput(df)
structure(list(ClickedDate = structure(c(1609460198.707, 1609462979.593,
1609465088.437, 1609476270.88, 1609478479.177, 1609479667.373,
1609493081.887, 1609499187.29, 1609507506.37, 1609510989.533,
1609511522.023, 1609511894.067, 1609512194.773, 1609512377.227,
1609514474.153), tzone = "UTC", class = c("POSIXct", "POSIXt"
)), batch_no = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L,
12L, 12L, 12L, 13L), batch_size = c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 3L, 3L, 3L, 1L)), row.names = c(NA, -15L), class = c("tbl_df",
"tbl", "data.frame"))
新编辑 - 感谢您为此付出努力。我收到一个错误
Error in UseMethod("mutate") :
no applicable method for 'mutate' applied to an object of class "c('integer', 'numeric')"
这看起来很奇怪,我的变量在 class
> class(df$ClickedDate)
[1] "POSIXct" "POSIXt"
这是否适用于 mutate,或者我需要转换它吗?
> dput(df)
structure(list(ClickedDate = structure(c(1609460198.707, 1609462979.593,
1609465088.437, 1609476270.88, 1609478479.177, 1609479667.373,
1609493081.887, 1609499187.29, 1609507506.37, 1609510989.533,
1609511522.023, 1609511894.067, 1609512194.773, 1609512377.227,
1609514474.153), tzone = "UTC", class = c("POSIXct", "POSIXt"
)), batch_no = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L,
12L, 12L, 12L, 13L), batch_size = c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 3L, 3L, 3L, 1L)), row.names = c(NA, -15L), class = c("tbl_df",
"tbl", "data.frame"))
提前致谢
使用 runner
包将有助于这种情况。使用以下策略
library(tidyverse)
library(runner)
df %>% mutate(b_len = runner::runner(x = ClickedDate,
idx = ClickedDate,
k = "15 mins",
lag = "-14 mins",
f = length),
b_no = purrr::accumulate(seq_len(length(b_len)-1), .init = b_len[1], ~ifelse(.x > .y, .x, .x + b_len[.x +1])),
b_no = as.integer(as.factor(b_no))) %>%
group_by(b_no) %>%
mutate(b_len = n())
# A tibble: 15 x 3
# Groups: b_no [12]
ClickedDate b_len b_no
<dttm> <int> <int>
1 2021-01-01 00:16:38 1 1
2 2021-01-01 01:02:59 1 2
3 2021-01-01 01:38:08 1 3
4 2021-01-01 04:44:30 1 4
5 2021-01-01 05:21:19 1 5
6 2021-01-01 05:41:07 1 6
7 2021-01-01 09:24:41 1 7
8 2021-01-01 11:06:27 1 8
9 2021-01-01 13:25:06 1 9
10 2021-01-01 14:23:09 2 10
11 2021-01-01 14:32:02 2 10
12 2021-01-01 14:38:14 3 11
13 2021-01-01 14:43:14 3 11
14 2021-01-01 14:46:17 3 11
15 2021-01-01 15:21:14 1 12
备注-
lag
runner
函数中的参数允许向后时间 window (滚动)所以我使用负滞后来使用向前时间 window.
k
runner
中的参数是给定的滚动长度 window
b_no
列最初标识 sliding/rolling window 最早 window 已用完,然后采用新的 window.
dense_rank
也可以使用(参见下面的替代方法)
或者
df %>% mutate(b_len = runner::runner(x = ClickedDate,
idx = ClickedDate,
k = "15 mins",
lag = "-14 mins",
f = length),
b_no = purrr::accumulate(seq_len(length(b_len)-1), .init = b_len[1], ~ifelse(.x > .y, .x, .x + b_len[.x +1])),
b_no = dense_rank(b_no)) %>%
group_by(b_no) %>%
mutate(b_len = n()) %>%
ungroup()
# A tibble: 15 x 3
ClickedDate b_len b_no
<dttm> <int> <int>
1 2021-01-01 00:16:38 1 1
2 2021-01-01 01:02:59 1 2
3 2021-01-01 01:38:08 1 3
4 2021-01-01 04:44:30 1 4
5 2021-01-01 05:21:19 1 5
6 2021-01-01 05:41:07 1 6
7 2021-01-01 09:24:41 1 7
8 2021-01-01 11:06:27 1 8
9 2021-01-01 13:25:06 1 9
10 2021-01-01 14:23:09 2 10
11 2021-01-01 14:32:02 2 10
12 2021-01-01 14:38:14 3 11
13 2021-01-01 14:43:14 3 11
14 2021-01-01 14:46:17 3 11
15 2021-01-01 15:21:14 1 12
使用的数据
df
> df
# A tibble: 15 x 1
ClickedDate
<dttm>
1 2021-01-01 00:16:38
2 2021-01-01 01:02:59
3 2021-01-01 01:38:08
4 2021-01-01 04:44:30
5 2021-01-01 05:21:19
6 2021-01-01 05:41:07
7 2021-01-01 09:24:41
8 2021-01-01 11:06:27
9 2021-01-01 13:25:06
10 2021-01-01 14:23:09
11 2021-01-01 14:32:02
12 2021-01-01 14:38:14
13 2021-01-01 14:43:14
14 2021-01-01 14:46:17
15 2021-01-01 15:21:14
1 2021-01-01 12:59:38
2 2021-01-01 14:08:59
3 2021-01-01 14:09:08
4 2021-01-01 14:11:30
5 2021-01-01 14:22:19
6 2021-01-01 14:41:07
我希望能够每 15 分钟计算一次条目数,但要在滚动的基础上进行。例如 12:59 将在 15 分钟内出现 1 个,14:08 => 14:22 将在 15 分钟内出现,因此这批 return 4 个,最后 14:41将在另一个 15 分钟的批次中单独出现。
我希望这是有道理的,并提前致谢。
抱歉没有包括这个
> dput(df)
structure(list(ClickedDate = structure(c(1609460198.707, 1609462979.593,
1609465088.437, 1609476270.88, 1609478479.177, 1609479667.373,
1609493081.887, 1609499187.29, 1609507506.37, 1609510989.533,
1609511522.023, 1609511894.067, 1609512194.773, 1609512377.227,
1609514474.153), tzone = "UTC", class = c("POSIXct", "POSIXt"
)), batch_no = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L,
12L, 12L, 12L, 13L), batch_size = c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 3L, 3L, 3L, 1L)), row.names = c(NA, -15L), class = c("tbl_df",
"tbl", "data.frame"))
新编辑 - 感谢您为此付出努力。我收到一个错误
Error in UseMethod("mutate") :
no applicable method for 'mutate' applied to an object of class "c('integer', 'numeric')"
这看起来很奇怪,我的变量在 class
> class(df$ClickedDate)
[1] "POSIXct" "POSIXt"
这是否适用于 mutate,或者我需要转换它吗?
> dput(df)
structure(list(ClickedDate = structure(c(1609460198.707, 1609462979.593,
1609465088.437, 1609476270.88, 1609478479.177, 1609479667.373,
1609493081.887, 1609499187.29, 1609507506.37, 1609510989.533,
1609511522.023, 1609511894.067, 1609512194.773, 1609512377.227,
1609514474.153), tzone = "UTC", class = c("POSIXct", "POSIXt"
)), batch_no = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L,
12L, 12L, 12L, 13L), batch_size = c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 3L, 3L, 3L, 1L)), row.names = c(NA, -15L), class = c("tbl_df",
"tbl", "data.frame"))
提前致谢
使用 runner
包将有助于这种情况。使用以下策略
library(tidyverse)
library(runner)
df %>% mutate(b_len = runner::runner(x = ClickedDate,
idx = ClickedDate,
k = "15 mins",
lag = "-14 mins",
f = length),
b_no = purrr::accumulate(seq_len(length(b_len)-1), .init = b_len[1], ~ifelse(.x > .y, .x, .x + b_len[.x +1])),
b_no = as.integer(as.factor(b_no))) %>%
group_by(b_no) %>%
mutate(b_len = n())
# A tibble: 15 x 3
# Groups: b_no [12]
ClickedDate b_len b_no
<dttm> <int> <int>
1 2021-01-01 00:16:38 1 1
2 2021-01-01 01:02:59 1 2
3 2021-01-01 01:38:08 1 3
4 2021-01-01 04:44:30 1 4
5 2021-01-01 05:21:19 1 5
6 2021-01-01 05:41:07 1 6
7 2021-01-01 09:24:41 1 7
8 2021-01-01 11:06:27 1 8
9 2021-01-01 13:25:06 1 9
10 2021-01-01 14:23:09 2 10
11 2021-01-01 14:32:02 2 10
12 2021-01-01 14:38:14 3 11
13 2021-01-01 14:43:14 3 11
14 2021-01-01 14:46:17 3 11
15 2021-01-01 15:21:14 1 12
备注-
lag
runner
函数中的参数允许向后时间 window (滚动)所以我使用负滞后来使用向前时间 window.k
runner
中的参数是给定的滚动长度 windowb_no
列最初标识 sliding/rolling window 最早 window 已用完,然后采用新的 window.dense_rank
也可以使用(参见下面的替代方法)
或者
df %>% mutate(b_len = runner::runner(x = ClickedDate,
idx = ClickedDate,
k = "15 mins",
lag = "-14 mins",
f = length),
b_no = purrr::accumulate(seq_len(length(b_len)-1), .init = b_len[1], ~ifelse(.x > .y, .x, .x + b_len[.x +1])),
b_no = dense_rank(b_no)) %>%
group_by(b_no) %>%
mutate(b_len = n()) %>%
ungroup()
# A tibble: 15 x 3
ClickedDate b_len b_no
<dttm> <int> <int>
1 2021-01-01 00:16:38 1 1
2 2021-01-01 01:02:59 1 2
3 2021-01-01 01:38:08 1 3
4 2021-01-01 04:44:30 1 4
5 2021-01-01 05:21:19 1 5
6 2021-01-01 05:41:07 1 6
7 2021-01-01 09:24:41 1 7
8 2021-01-01 11:06:27 1 8
9 2021-01-01 13:25:06 1 9
10 2021-01-01 14:23:09 2 10
11 2021-01-01 14:32:02 2 10
12 2021-01-01 14:38:14 3 11
13 2021-01-01 14:43:14 3 11
14 2021-01-01 14:46:17 3 11
15 2021-01-01 15:21:14 1 12
使用的数据
df
> df
# A tibble: 15 x 1
ClickedDate
<dttm>
1 2021-01-01 00:16:38
2 2021-01-01 01:02:59
3 2021-01-01 01:38:08
4 2021-01-01 04:44:30
5 2021-01-01 05:21:19
6 2021-01-01 05:41:07
7 2021-01-01 09:24:41
8 2021-01-01 11:06:27
9 2021-01-01 13:25:06
10 2021-01-01 14:23:09
11 2021-01-01 14:32:02
12 2021-01-01 14:38:14
13 2021-01-01 14:43:14
14 2021-01-01 14:46:17
15 2021-01-01 15:21:14