时间数据:宽到长

Time data: Wide to Long

我不确定我是否正确使用了术语 wide/long,但我正在尝试重新格式化轮班数据,以便我可以看到每小时增加了多少劳动力。

假设我有以下数据集:

library(data.table)
sample_DT <- data.table(
  store = c("A", "A", "A", "A", "B", "B"),
  date = ymd(c("2019-03-24", "2019-03-24", "2019-03-24", "2019-03-24", "2019-03-24", "2019-03-24")),
  start_hr = c(23,0,2,7,4,2),
  duration_hr = c(8,4,4,12,6,10)
  )

看起来像:

     store     date    start_hr duration_hr
   <char>     <Date>    <num>       <num>
1:      A 2019-03-24       23           8
2:      A 2019-03-24        0           4
3:      A 2019-03-24        2           4
4:      A 2019-03-24        7          12
5:      B 2019-03-24        4           6
6:      B 2019-03-24        2          10

我试图查看每个日期的每个小时间隔(0-1、1-2、2-3、3-4,...)期间,商店使用了多少劳动力。所以数据应该是这样的:


store    date    time_hr usage
A   2019-03-24    0       1
A   2019-03-24    1       1
A   2019-03-24    2       2
A   2019-03-24    3       2
A   2019-03-24    4       1
A   2019-03-24    5       1 
B   ...
B   ...

在上面,time_hr表示时间间隔(例如,time_hr = 0表示从午夜到凌晨1点的间隔)。请注意,有时班次可以 运行 多天(例如从 23:00 开始,持续 8 小时)。

谢谢!

这可以通过几种方式完成,这是一种使用旋转数据的方法,这样每个班次开始都变成 +1,每个班次结束都变成 -1

library(dplyr); library(tidyr); library(lubridate)
sample_df %>%
  transmute(store, 
            start_hr = ymd_h(paste(date, start_hr)),
            end_hr = start_hr + dhours(duration_hr)) %>%
  pivot_longer(-store, 
               names_to = "change",
               values_to = "timestamp") %>%
  mutate(change = if_else(change == "start_hr", 1, -1)) %>%
  count(store, timestamp, wt = change, name = "change") %>%  # total change/hr
  arrange(store, timestamp) %>%
  complete(store, 
           timestamp = seq.POSIXt(min(timestamp), max(timestamp), 
                                  by = "hour"), 
           fill = list(change = 0)) %>%
  mutate(usage = cumsum(change))

结果

# A tibble: 64 × 4
   store timestamp           change usage
   <chr> <dttm>               <dbl> <dbl>
 1 A     2019-03-24 00:00:00      1     1
 2 A     2019-03-24 01:00:00      0     1
 3 A     2019-03-24 02:00:00      1     2
 4 A     2019-03-24 03:00:00      0     2
 5 A     2019-03-24 04:00:00     -1     1
 6 A     2019-03-24 05:00:00      0     1
 7 A     2019-03-24 06:00:00     -1     0
 8 A     2019-03-24 07:00:00      1     1
 9 A     2019-03-24 08:00:00      0     1
10 A     2019-03-24 09:00:00      0     1
# … with 54 more rows

这是一种使用 data.table 并拆分出一个小函数 f 的方法,该函数构成一系列 hr 间隔。不同的间隔然后由商店求和。

f <- function(d,h, dur) seq(ymd_h(paste(d,h)), by="hour",length.out=dur)

sample_DT[,id:=.I] %>% 
  .[, .(hr_interval = f(date,start_hr, duration_hr)), by=.(store, id)] %>% 
  .[, .(usage = .N), .(store, hr_interval)] %>% 
  .[order(store,hr_interval)]

输出:

    store         hr_interval usage
 1:     A 2019-03-24 00:00:00     1
 2:     A 2019-03-24 01:00:00     1
 3:     A 2019-03-24 02:00:00     2
 4:     A 2019-03-24 03:00:00     2
 5:     A 2019-03-24 04:00:00     1
 6:     A 2019-03-24 05:00:00     1
 7:     A 2019-03-24 07:00:00     1
 8:     A 2019-03-24 08:00:00     1
 9:     A 2019-03-24 09:00:00     1
10:     A 2019-03-24 10:00:00     1
11:     A 2019-03-24 11:00:00     1
12:     A 2019-03-24 12:00:00     1
13:     A 2019-03-24 13:00:00     1
14:     A 2019-03-24 14:00:00     1
15:     A 2019-03-24 15:00:00     1
16:     A 2019-03-24 16:00:00     1
17:     A 2019-03-24 17:00:00     1
18:     A 2019-03-24 18:00:00     1
19:     A 2019-03-24 23:00:00     1
20:     A 2019-03-25 00:00:00     1
21:     A 2019-03-25 01:00:00     1
22:     A 2019-03-25 02:00:00     1
23:     A 2019-03-25 03:00:00     1
24:     A 2019-03-25 04:00:00     1
25:     A 2019-03-25 05:00:00     1
26:     A 2019-03-25 06:00:00     1
27:     B 2019-03-24 02:00:00     1
28:     B 2019-03-24 03:00:00     1
29:     B 2019-03-24 04:00:00     2
30:     B 2019-03-24 05:00:00     2
31:     B 2019-03-24 06:00:00     2
32:     B 2019-03-24 07:00:00     2
33:     B 2019-03-24 08:00:00     2
34:     B 2019-03-24 09:00:00     2
35:     B 2019-03-24 10:00:00     1
36:     B 2019-03-24 11:00:00     1
    store         hr_interval usage

为了完整起见,我想提一下 IRanges 包中的 coverage() 函数,它完成了所有繁重的工作。它在允许结果的紧凑 RLE(运行 长度编码)表示的间隔上工作,并将其转换为 OP 要求的网格显示。

唯一的缺点是它需要 整数 间隔。因此,我们必须来回强制将 date-time 间隔转换为整数间隔。

library(data.table)
library(lubridate)
library(magrittr)
if (!"IRanges" %in% rownames(installed.packages())) {
  install.packages("IRanges", 
                   repos = "https://bioconductor.org/packages/3.15/bioc")
}
library(IRanges)

origin <- min(d$date)
usage <- d[, IRanges(start = as.integer(date - origin) * 24L + start_hr, 
                     width = duration_hr, 
                     names = store)] %>% 
  split(names(.)) %>% 
  coverage() 

usage
RleList of length 2
$A
integer-Rle of length 30 with 7 runs
  Lengths:  1  2  2  1 12  4  8
  Values :  1  2  1  0  1  0  1

$B
integer-Rle of length 11 with 4 runs
  Lengths: 1 2 6 2
  Values : 0 1 2 1

对于每个商店,这是使用情况(或覆盖范围)的 运行 长度编码表示。对于商店 A,例如,第一个小时的使用量为 1,接下来的 2 小时内的使用量为 2,依此类推。

这可以强制转换为更易于理解的使用间隔表示:

lapply(usage, function(x) data.table(
  start_dt = origin + hours(start(x) - 1L),
  duration_hr = width(x),
  usage = runValue(x))) %>% 
  rbindlist(idcol = "shop")
      shop            start_dt duration_hr usage
 1:      A 2019-03-24 00:00:00           1     1
 2:      A 2019-03-24 01:00:00           2     2
 3:      A 2019-03-24 03:00:00           2     1
 4:      A 2019-03-24 05:00:00           1     0
 5:      A 2019-03-24 06:00:00          12     1
 6:      A 2019-03-24 18:00:00           4     0
 7:      A 2019-03-24 22:00:00           8     1
 8:      B 2019-03-24 00:00:00           1     0
 9:      B 2019-03-24 01:00:00           2     1
10:      B 2019-03-24 03:00:00           6     2
11:      B 2019-03-24 09:00:00           2     1

或者,可以按照 OP 的要求将使用情况显示为一系列每小时间隔:

lapply(usage, function(x) data.table(
  usage = decode(x))[, timestamp := origin + hours(seq_along(usage) - 1L)]) %>% 
  rbindlist(idcol = "shop")
      shop usage           timestamp
 1:      A     1 2019-03-24 00:00:00
 2:      A     2 2019-03-24 01:00:00
 3:      A     2 2019-03-24 02:00:00
 4:      A     1 2019-03-24 03:00:00
 5:      A     1 2019-03-24 04:00:00
 6:      A     0 2019-03-24 05:00:00
 7:      A     1 2019-03-24 06:00:00
 8:      A     1 2019-03-24 07:00:00
 9:      A     1 2019-03-24 08:00:00
10:      A     1 2019-03-24 09:00:00
11:      A     1 2019-03-24 10:00:00
12:      A     1 2019-03-24 11:00:00
13:      A     1 2019-03-24 12:00:00
14:      A     1 2019-03-24 13:00:00
15:      A     1 2019-03-24 14:00:00
16:      A     1 2019-03-24 15:00:00
17:      A     1 2019-03-24 16:00:00
18:      A     1 2019-03-24 17:00:00
19:      A     0 2019-03-24 18:00:00
20:      A     0 2019-03-24 19:00:00
21:      A     0 2019-03-24 20:00:00
22:      A     0 2019-03-24 21:00:00
23:      A     1 2019-03-24 22:00:00
24:      A     1 2019-03-24 23:00:00
25:      A     1 2019-03-25 00:00:00
26:      A     1 2019-03-25 01:00:00
27:      A     1 2019-03-25 02:00:00
28:      A     1 2019-03-25 03:00:00
29:      A     1 2019-03-25 04:00:00
30:      A     1 2019-03-25 05:00:00
31:      B     0 2019-03-24 00:00:00
32:      B     1 2019-03-24 01:00:00
33:      B     1 2019-03-24 02:00:00
34:      B     2 2019-03-24 03:00:00
35:      B     2 2019-03-24 04:00:00
36:      B     2 2019-03-24 05:00:00
37:      B     2 2019-03-24 06:00:00
38:      B     2 2019-03-24 07:00:00
39:      B     2 2019-03-24 08:00:00
40:      B     1 2019-03-24 09:00:00
41:      B     1 2019-03-24 10:00:00
      shop usage           timestamp

数据

library(data.table)
library(lubridate)
d <- data.table(
  store = c("A", "A", "A", "A", "B", "B"),
  date = ymd(rep("2019-03-24", 6L)),
  start_hr = c(23, 0, 2, 7, 4, 2),
  duration_hr = c(8, 4, 4, 12, 6, 10)
)