时间数据:宽到长
Time data: Wide to Long
我不确定我是否正确使用了术语 wide/long,但我正在尝试重新格式化轮班数据,以便我可以看到每小时增加了多少劳动力。
假设我有以下数据集:
library(data.table)
sample_DT <- data.table(
store = c("A", "A", "A", "A", "B", "B"),
date = ymd(c("2019-03-24", "2019-03-24", "2019-03-24", "2019-03-24", "2019-03-24", "2019-03-24")),
start_hr = c(23,0,2,7,4,2),
duration_hr = c(8,4,4,12,6,10)
)
看起来像:
store date start_hr duration_hr
<char> <Date> <num> <num>
1: A 2019-03-24 23 8
2: A 2019-03-24 0 4
3: A 2019-03-24 2 4
4: A 2019-03-24 7 12
5: B 2019-03-24 4 6
6: B 2019-03-24 2 10
我试图查看每个日期的每个小时间隔(0-1、1-2、2-3、3-4,...)期间,商店使用了多少劳动力。所以数据应该是这样的:
store date time_hr usage
A 2019-03-24 0 1
A 2019-03-24 1 1
A 2019-03-24 2 2
A 2019-03-24 3 2
A 2019-03-24 4 1
A 2019-03-24 5 1
B ...
B ...
在上面,time_hr
表示时间间隔(例如,time_hr = 0
表示从午夜到凌晨1点的间隔)。请注意,有时班次可以 运行 多天(例如从 23:00 开始,持续 8 小时)。
谢谢!
这可以通过几种方式完成,这是一种使用旋转数据的方法,这样每个班次开始都变成 +1
,每个班次结束都变成 -1
。
library(dplyr); library(tidyr); library(lubridate)
sample_df %>%
transmute(store,
start_hr = ymd_h(paste(date, start_hr)),
end_hr = start_hr + dhours(duration_hr)) %>%
pivot_longer(-store,
names_to = "change",
values_to = "timestamp") %>%
mutate(change = if_else(change == "start_hr", 1, -1)) %>%
count(store, timestamp, wt = change, name = "change") %>% # total change/hr
arrange(store, timestamp) %>%
complete(store,
timestamp = seq.POSIXt(min(timestamp), max(timestamp),
by = "hour"),
fill = list(change = 0)) %>%
mutate(usage = cumsum(change))
结果
# A tibble: 64 × 4
store timestamp change usage
<chr> <dttm> <dbl> <dbl>
1 A 2019-03-24 00:00:00 1 1
2 A 2019-03-24 01:00:00 0 1
3 A 2019-03-24 02:00:00 1 2
4 A 2019-03-24 03:00:00 0 2
5 A 2019-03-24 04:00:00 -1 1
6 A 2019-03-24 05:00:00 0 1
7 A 2019-03-24 06:00:00 -1 0
8 A 2019-03-24 07:00:00 1 1
9 A 2019-03-24 08:00:00 0 1
10 A 2019-03-24 09:00:00 0 1
# … with 54 more rows
这是一种使用 data.table 并拆分出一个小函数 f
的方法,该函数构成一系列 hr 间隔。不同的间隔然后由商店求和。
f <- function(d,h, dur) seq(ymd_h(paste(d,h)), by="hour",length.out=dur)
sample_DT[,id:=.I] %>%
.[, .(hr_interval = f(date,start_hr, duration_hr)), by=.(store, id)] %>%
.[, .(usage = .N), .(store, hr_interval)] %>%
.[order(store,hr_interval)]
输出:
store hr_interval usage
1: A 2019-03-24 00:00:00 1
2: A 2019-03-24 01:00:00 1
3: A 2019-03-24 02:00:00 2
4: A 2019-03-24 03:00:00 2
5: A 2019-03-24 04:00:00 1
6: A 2019-03-24 05:00:00 1
7: A 2019-03-24 07:00:00 1
8: A 2019-03-24 08:00:00 1
9: A 2019-03-24 09:00:00 1
10: A 2019-03-24 10:00:00 1
11: A 2019-03-24 11:00:00 1
12: A 2019-03-24 12:00:00 1
13: A 2019-03-24 13:00:00 1
14: A 2019-03-24 14:00:00 1
15: A 2019-03-24 15:00:00 1
16: A 2019-03-24 16:00:00 1
17: A 2019-03-24 17:00:00 1
18: A 2019-03-24 18:00:00 1
19: A 2019-03-24 23:00:00 1
20: A 2019-03-25 00:00:00 1
21: A 2019-03-25 01:00:00 1
22: A 2019-03-25 02:00:00 1
23: A 2019-03-25 03:00:00 1
24: A 2019-03-25 04:00:00 1
25: A 2019-03-25 05:00:00 1
26: A 2019-03-25 06:00:00 1
27: B 2019-03-24 02:00:00 1
28: B 2019-03-24 03:00:00 1
29: B 2019-03-24 04:00:00 2
30: B 2019-03-24 05:00:00 2
31: B 2019-03-24 06:00:00 2
32: B 2019-03-24 07:00:00 2
33: B 2019-03-24 08:00:00 2
34: B 2019-03-24 09:00:00 2
35: B 2019-03-24 10:00:00 1
36: B 2019-03-24 11:00:00 1
store hr_interval usage
为了完整起见,我想提一下 IRanges
包中的 coverage()
函数,它完成了所有繁重的工作。它在允许结果的紧凑 RLE(运行 长度编码)表示的间隔上工作,并将其转换为 OP 要求的网格显示。
唯一的缺点是它需要 整数 间隔。因此,我们必须来回强制将 date-time 间隔转换为整数间隔。
library(data.table)
library(lubridate)
library(magrittr)
if (!"IRanges" %in% rownames(installed.packages())) {
install.packages("IRanges",
repos = "https://bioconductor.org/packages/3.15/bioc")
}
library(IRanges)
origin <- min(d$date)
usage <- d[, IRanges(start = as.integer(date - origin) * 24L + start_hr,
width = duration_hr,
names = store)] %>%
split(names(.)) %>%
coverage()
usage
RleList of length 2
$A
integer-Rle of length 30 with 7 runs
Lengths: 1 2 2 1 12 4 8
Values : 1 2 1 0 1 0 1
$B
integer-Rle of length 11 with 4 runs
Lengths: 1 2 6 2
Values : 0 1 2 1
对于每个商店,这是使用情况(或覆盖范围)的 运行 长度编码表示。对于商店 A
,例如,第一个小时的使用量为 1,接下来的 2 小时内的使用量为 2,依此类推。
这可以强制转换为更易于理解的使用间隔表示:
lapply(usage, function(x) data.table(
start_dt = origin + hours(start(x) - 1L),
duration_hr = width(x),
usage = runValue(x))) %>%
rbindlist(idcol = "shop")
shop start_dt duration_hr usage
1: A 2019-03-24 00:00:00 1 1
2: A 2019-03-24 01:00:00 2 2
3: A 2019-03-24 03:00:00 2 1
4: A 2019-03-24 05:00:00 1 0
5: A 2019-03-24 06:00:00 12 1
6: A 2019-03-24 18:00:00 4 0
7: A 2019-03-24 22:00:00 8 1
8: B 2019-03-24 00:00:00 1 0
9: B 2019-03-24 01:00:00 2 1
10: B 2019-03-24 03:00:00 6 2
11: B 2019-03-24 09:00:00 2 1
或者,可以按照 OP 的要求将使用情况显示为一系列每小时间隔:
lapply(usage, function(x) data.table(
usage = decode(x))[, timestamp := origin + hours(seq_along(usage) - 1L)]) %>%
rbindlist(idcol = "shop")
shop usage timestamp
1: A 1 2019-03-24 00:00:00
2: A 2 2019-03-24 01:00:00
3: A 2 2019-03-24 02:00:00
4: A 1 2019-03-24 03:00:00
5: A 1 2019-03-24 04:00:00
6: A 0 2019-03-24 05:00:00
7: A 1 2019-03-24 06:00:00
8: A 1 2019-03-24 07:00:00
9: A 1 2019-03-24 08:00:00
10: A 1 2019-03-24 09:00:00
11: A 1 2019-03-24 10:00:00
12: A 1 2019-03-24 11:00:00
13: A 1 2019-03-24 12:00:00
14: A 1 2019-03-24 13:00:00
15: A 1 2019-03-24 14:00:00
16: A 1 2019-03-24 15:00:00
17: A 1 2019-03-24 16:00:00
18: A 1 2019-03-24 17:00:00
19: A 0 2019-03-24 18:00:00
20: A 0 2019-03-24 19:00:00
21: A 0 2019-03-24 20:00:00
22: A 0 2019-03-24 21:00:00
23: A 1 2019-03-24 22:00:00
24: A 1 2019-03-24 23:00:00
25: A 1 2019-03-25 00:00:00
26: A 1 2019-03-25 01:00:00
27: A 1 2019-03-25 02:00:00
28: A 1 2019-03-25 03:00:00
29: A 1 2019-03-25 04:00:00
30: A 1 2019-03-25 05:00:00
31: B 0 2019-03-24 00:00:00
32: B 1 2019-03-24 01:00:00
33: B 1 2019-03-24 02:00:00
34: B 2 2019-03-24 03:00:00
35: B 2 2019-03-24 04:00:00
36: B 2 2019-03-24 05:00:00
37: B 2 2019-03-24 06:00:00
38: B 2 2019-03-24 07:00:00
39: B 2 2019-03-24 08:00:00
40: B 1 2019-03-24 09:00:00
41: B 1 2019-03-24 10:00:00
shop usage timestamp
数据
library(data.table)
library(lubridate)
d <- data.table(
store = c("A", "A", "A", "A", "B", "B"),
date = ymd(rep("2019-03-24", 6L)),
start_hr = c(23, 0, 2, 7, 4, 2),
duration_hr = c(8, 4, 4, 12, 6, 10)
)
我不确定我是否正确使用了术语 wide/long,但我正在尝试重新格式化轮班数据,以便我可以看到每小时增加了多少劳动力。
假设我有以下数据集:
library(data.table)
sample_DT <- data.table(
store = c("A", "A", "A", "A", "B", "B"),
date = ymd(c("2019-03-24", "2019-03-24", "2019-03-24", "2019-03-24", "2019-03-24", "2019-03-24")),
start_hr = c(23,0,2,7,4,2),
duration_hr = c(8,4,4,12,6,10)
)
看起来像:
store date start_hr duration_hr
<char> <Date> <num> <num>
1: A 2019-03-24 23 8
2: A 2019-03-24 0 4
3: A 2019-03-24 2 4
4: A 2019-03-24 7 12
5: B 2019-03-24 4 6
6: B 2019-03-24 2 10
我试图查看每个日期的每个小时间隔(0-1、1-2、2-3、3-4,...)期间,商店使用了多少劳动力。所以数据应该是这样的:
store date time_hr usage
A 2019-03-24 0 1
A 2019-03-24 1 1
A 2019-03-24 2 2
A 2019-03-24 3 2
A 2019-03-24 4 1
A 2019-03-24 5 1
B ...
B ...
在上面,time_hr
表示时间间隔(例如,time_hr = 0
表示从午夜到凌晨1点的间隔)。请注意,有时班次可以 运行 多天(例如从 23:00 开始,持续 8 小时)。
谢谢!
这可以通过几种方式完成,这是一种使用旋转数据的方法,这样每个班次开始都变成 +1
,每个班次结束都变成 -1
。
library(dplyr); library(tidyr); library(lubridate)
sample_df %>%
transmute(store,
start_hr = ymd_h(paste(date, start_hr)),
end_hr = start_hr + dhours(duration_hr)) %>%
pivot_longer(-store,
names_to = "change",
values_to = "timestamp") %>%
mutate(change = if_else(change == "start_hr", 1, -1)) %>%
count(store, timestamp, wt = change, name = "change") %>% # total change/hr
arrange(store, timestamp) %>%
complete(store,
timestamp = seq.POSIXt(min(timestamp), max(timestamp),
by = "hour"),
fill = list(change = 0)) %>%
mutate(usage = cumsum(change))
结果
# A tibble: 64 × 4
store timestamp change usage
<chr> <dttm> <dbl> <dbl>
1 A 2019-03-24 00:00:00 1 1
2 A 2019-03-24 01:00:00 0 1
3 A 2019-03-24 02:00:00 1 2
4 A 2019-03-24 03:00:00 0 2
5 A 2019-03-24 04:00:00 -1 1
6 A 2019-03-24 05:00:00 0 1
7 A 2019-03-24 06:00:00 -1 0
8 A 2019-03-24 07:00:00 1 1
9 A 2019-03-24 08:00:00 0 1
10 A 2019-03-24 09:00:00 0 1
# … with 54 more rows
这是一种使用 data.table 并拆分出一个小函数 f
的方法,该函数构成一系列 hr 间隔。不同的间隔然后由商店求和。
f <- function(d,h, dur) seq(ymd_h(paste(d,h)), by="hour",length.out=dur)
sample_DT[,id:=.I] %>%
.[, .(hr_interval = f(date,start_hr, duration_hr)), by=.(store, id)] %>%
.[, .(usage = .N), .(store, hr_interval)] %>%
.[order(store,hr_interval)]
输出:
store hr_interval usage
1: A 2019-03-24 00:00:00 1
2: A 2019-03-24 01:00:00 1
3: A 2019-03-24 02:00:00 2
4: A 2019-03-24 03:00:00 2
5: A 2019-03-24 04:00:00 1
6: A 2019-03-24 05:00:00 1
7: A 2019-03-24 07:00:00 1
8: A 2019-03-24 08:00:00 1
9: A 2019-03-24 09:00:00 1
10: A 2019-03-24 10:00:00 1
11: A 2019-03-24 11:00:00 1
12: A 2019-03-24 12:00:00 1
13: A 2019-03-24 13:00:00 1
14: A 2019-03-24 14:00:00 1
15: A 2019-03-24 15:00:00 1
16: A 2019-03-24 16:00:00 1
17: A 2019-03-24 17:00:00 1
18: A 2019-03-24 18:00:00 1
19: A 2019-03-24 23:00:00 1
20: A 2019-03-25 00:00:00 1
21: A 2019-03-25 01:00:00 1
22: A 2019-03-25 02:00:00 1
23: A 2019-03-25 03:00:00 1
24: A 2019-03-25 04:00:00 1
25: A 2019-03-25 05:00:00 1
26: A 2019-03-25 06:00:00 1
27: B 2019-03-24 02:00:00 1
28: B 2019-03-24 03:00:00 1
29: B 2019-03-24 04:00:00 2
30: B 2019-03-24 05:00:00 2
31: B 2019-03-24 06:00:00 2
32: B 2019-03-24 07:00:00 2
33: B 2019-03-24 08:00:00 2
34: B 2019-03-24 09:00:00 2
35: B 2019-03-24 10:00:00 1
36: B 2019-03-24 11:00:00 1
store hr_interval usage
为了完整起见,我想提一下 IRanges
包中的 coverage()
函数,它完成了所有繁重的工作。它在允许结果的紧凑 RLE(运行 长度编码)表示的间隔上工作,并将其转换为 OP 要求的网格显示。
唯一的缺点是它需要 整数 间隔。因此,我们必须来回强制将 date-time 间隔转换为整数间隔。
library(data.table)
library(lubridate)
library(magrittr)
if (!"IRanges" %in% rownames(installed.packages())) {
install.packages("IRanges",
repos = "https://bioconductor.org/packages/3.15/bioc")
}
library(IRanges)
origin <- min(d$date)
usage <- d[, IRanges(start = as.integer(date - origin) * 24L + start_hr,
width = duration_hr,
names = store)] %>%
split(names(.)) %>%
coverage()
usage
RleList of length 2 $A integer-Rle of length 30 with 7 runs Lengths: 1 2 2 1 12 4 8 Values : 1 2 1 0 1 0 1 $B integer-Rle of length 11 with 4 runs Lengths: 1 2 6 2 Values : 0 1 2 1
对于每个商店,这是使用情况(或覆盖范围)的 运行 长度编码表示。对于商店 A
,例如,第一个小时的使用量为 1,接下来的 2 小时内的使用量为 2,依此类推。
这可以强制转换为更易于理解的使用间隔表示:
lapply(usage, function(x) data.table(
start_dt = origin + hours(start(x) - 1L),
duration_hr = width(x),
usage = runValue(x))) %>%
rbindlist(idcol = "shop")
shop start_dt duration_hr usage 1: A 2019-03-24 00:00:00 1 1 2: A 2019-03-24 01:00:00 2 2 3: A 2019-03-24 03:00:00 2 1 4: A 2019-03-24 05:00:00 1 0 5: A 2019-03-24 06:00:00 12 1 6: A 2019-03-24 18:00:00 4 0 7: A 2019-03-24 22:00:00 8 1 8: B 2019-03-24 00:00:00 1 0 9: B 2019-03-24 01:00:00 2 1 10: B 2019-03-24 03:00:00 6 2 11: B 2019-03-24 09:00:00 2 1
或者,可以按照 OP 的要求将使用情况显示为一系列每小时间隔:
lapply(usage, function(x) data.table(
usage = decode(x))[, timestamp := origin + hours(seq_along(usage) - 1L)]) %>%
rbindlist(idcol = "shop")
shop usage timestamp 1: A 1 2019-03-24 00:00:00 2: A 2 2019-03-24 01:00:00 3: A 2 2019-03-24 02:00:00 4: A 1 2019-03-24 03:00:00 5: A 1 2019-03-24 04:00:00 6: A 0 2019-03-24 05:00:00 7: A 1 2019-03-24 06:00:00 8: A 1 2019-03-24 07:00:00 9: A 1 2019-03-24 08:00:00 10: A 1 2019-03-24 09:00:00 11: A 1 2019-03-24 10:00:00 12: A 1 2019-03-24 11:00:00 13: A 1 2019-03-24 12:00:00 14: A 1 2019-03-24 13:00:00 15: A 1 2019-03-24 14:00:00 16: A 1 2019-03-24 15:00:00 17: A 1 2019-03-24 16:00:00 18: A 1 2019-03-24 17:00:00 19: A 0 2019-03-24 18:00:00 20: A 0 2019-03-24 19:00:00 21: A 0 2019-03-24 20:00:00 22: A 0 2019-03-24 21:00:00 23: A 1 2019-03-24 22:00:00 24: A 1 2019-03-24 23:00:00 25: A 1 2019-03-25 00:00:00 26: A 1 2019-03-25 01:00:00 27: A 1 2019-03-25 02:00:00 28: A 1 2019-03-25 03:00:00 29: A 1 2019-03-25 04:00:00 30: A 1 2019-03-25 05:00:00 31: B 0 2019-03-24 00:00:00 32: B 1 2019-03-24 01:00:00 33: B 1 2019-03-24 02:00:00 34: B 2 2019-03-24 03:00:00 35: B 2 2019-03-24 04:00:00 36: B 2 2019-03-24 05:00:00 37: B 2 2019-03-24 06:00:00 38: B 2 2019-03-24 07:00:00 39: B 2 2019-03-24 08:00:00 40: B 1 2019-03-24 09:00:00 41: B 1 2019-03-24 10:00:00 shop usage timestamp
数据
library(data.table)
library(lubridate)
d <- data.table(
store = c("A", "A", "A", "A", "B", "B"),
date = ymd(rep("2019-03-24", 6L)),
start_hr = c(23, 0, 2, 7, 4, 2),
duration_hr = c(8, 4, 4, 12, 6, 10)
)