在 R 中更新日期和日期间隔
Updating Dates and Date Intervals in R
甚至不确定我是否在标题中准确描述了问题,但这里是。
假设我有以下数据。table/data.frame:
library(data.table)
library(lubridate)
DT <- data.table(begin = c("2019-06-01 09:00:00","2019-06-01 09:00:00", "2019-06-01 09:00:00",
"2019-06-01 09:00:00", "2016-06-01 09:00:00","2016-06-01 09:00:00"),
end = c("2019-06-03 14:00:00", "2019-06-03 14:00:00", "2019-06-03 14:00:00",
"2019-06-02 05:00:00", "2019-06-02 05:00:00", "2016-06-01 23:15:00"),
person = c("A", "A","A", "B", "B", "C"))
begin end person
1: 2019-06-01 09:00:00 2019-06-03 14:00:00 A
2: 2019-06-01 09:00:00 2019-06-03 14:00:00 A
3: 2019-06-01 09:00:00 2019-06-03 14:00:00 A
4: 2019-06-01 09:00:00 2019-06-02 05:00:00 B
5: 2016-06-01 09:00:00 2019-06-02 05:00:00 B
6: 2016-06-01 09:00:00 2016-06-01 23:15:00 C
这本质上是一个数据集,汇总了每个人的经期开始和结束时间的时间戳。每个人的行数按时间段跨越的天数重复。例如,人 A
有同一个“班次”的三个条目,因为他们的班次跨越三个不同的日期,06-01、06-02 和 06-03。这些条目按班次跨度的日期数重复,但有些班次在同一天开始和结束。
我想要的是更新上述数据集的开始和结束日期,这样我就可以看到每个班次在天级别的开始和结束时间。所以数据集应该是这样的:
begin end person
1: 2019-06-01 09:00:00 2019-06-02 00:00:00 A
2: 2019-06-02 00:00:00 2019-06-03 00:00:00 A
3: 2019-06-03 00:00:00 2019-06-03 14:00:00 A
4: 2019-06-01 09:00:00 2019-06-02 00:00:00 B
5: 2016-06-02 00:00:00 2019-06-02 05:00:00 B
6: 2016-06-01 09:00:00 2016-06-01 23:15:00 C
如有任何帮助,我们将不胜感激!
假设第 5 行的人 B 有错字(从 2019 年开始,而不是 2016 年):
library(data.table)
library(lubridate)
> DT <- data.table(begin = c("2019-06-01 09:00:00","2019-06-01 09:00:00", "2019-06-01 09:00:00",
+ "2019-06-01 09:00:00", "2019-06-01 09:00:00","2016-06-01 09:00:00"),
+ end = c("2019-06-03 14:00:00", "2019-06-03 14:00:00", "2019-06-03 14:00:00",
+ "2019-06-02 05:00:00", "2019-06-02 05:00:00", "2016-06-01 23:15:00"),
+ person = c("A", "A","A", "B", "B", "C"))
>
> DT[, `:=`(min=as.numeric(difftime(end,begin, units="mins")),
+ days=as.numeric(as_date(end)-as_date(begin)+1))][, min_day:=min/days]
>
> unique(DT)
begin end person min days min_day
1: 2019-06-01 09:00:00 2019-06-03 14:00:00 A 3180 3 1060
2: 2019-06-01 09:00:00 2019-06-02 05:00:00 B 1200 2 600
3: 2016-06-01 09:00:00 2016-06-01 23:15:00 C 855 1 855
首先,确定日期(我已经确定了从 2016 年开始到 2019 年的第 5 行,这似乎不太可能),
DT[, c("begin", "end"):=lapply(.SD, as.POSIXct), .SDcols=c("begin", "end")]
## we get this
DT <- as.data.table(structure(list(begin = structure(c(1559394000, 1559394000, 1559394000, 1559394000, 1559394000, 1464786000), class = c("POSIXct", "POSIXt"), tzone = ""), end = structure(c(1559584800, 1559584800, 1559584800, 1559466000, 1559466000, 1464837300), class = c("POSIXct", "POSIXt"), tzone = ""), person = c("A", "A", "A", "B", "B", "C")), row.names = c(NA, -6L), class = c("data.table", "data.frame")))
其次,我们接着创建这个函数
func <- function(st, en) {
midns <- lubridate::ceiling_date(seq(st, en, by = "day"), unit = "day")
times <- unique(sort(c(midns[ st < midns & midns < en], st, en)))
data.table(begin = times[-length(times)], end = times[-1])
}
最后,我们使用它,使用 by=.(person)
在输出中保留该列。我使用 DT
因为我们不需要(甚至 想要 )重复每个 shift/day:
unique(DT)[, rbindlist(Map(func, begin, end)), by = .(person)]
# person begin end
# <char> <POSc> <POSc>
# 1: A 2019-06-01 09:00:00 2019-06-02 00:00:00
# 2: A 2019-06-02 00:00:00 2019-06-03 00:00:00
# 3: A 2019-06-03 00:00:00 2019-06-03 14:00:00
# 4: B 2019-06-01 09:00:00 2019-06-02 00:00:00
# 5: B 2019-06-02 00:00:00 2019-06-02 05:00:00
# 6: C 2016-06-01 09:00:00 2016-06-01 23:15:00
甚至不确定我是否在标题中准确描述了问题,但这里是。
假设我有以下数据。table/data.frame:
library(data.table)
library(lubridate)
DT <- data.table(begin = c("2019-06-01 09:00:00","2019-06-01 09:00:00", "2019-06-01 09:00:00",
"2019-06-01 09:00:00", "2016-06-01 09:00:00","2016-06-01 09:00:00"),
end = c("2019-06-03 14:00:00", "2019-06-03 14:00:00", "2019-06-03 14:00:00",
"2019-06-02 05:00:00", "2019-06-02 05:00:00", "2016-06-01 23:15:00"),
person = c("A", "A","A", "B", "B", "C"))
begin end person
1: 2019-06-01 09:00:00 2019-06-03 14:00:00 A
2: 2019-06-01 09:00:00 2019-06-03 14:00:00 A
3: 2019-06-01 09:00:00 2019-06-03 14:00:00 A
4: 2019-06-01 09:00:00 2019-06-02 05:00:00 B
5: 2016-06-01 09:00:00 2019-06-02 05:00:00 B
6: 2016-06-01 09:00:00 2016-06-01 23:15:00 C
这本质上是一个数据集,汇总了每个人的经期开始和结束时间的时间戳。每个人的行数按时间段跨越的天数重复。例如,人 A
有同一个“班次”的三个条目,因为他们的班次跨越三个不同的日期,06-01、06-02 和 06-03。这些条目按班次跨度的日期数重复,但有些班次在同一天开始和结束。
我想要的是更新上述数据集的开始和结束日期,这样我就可以看到每个班次在天级别的开始和结束时间。所以数据集应该是这样的:
begin end person
1: 2019-06-01 09:00:00 2019-06-02 00:00:00 A
2: 2019-06-02 00:00:00 2019-06-03 00:00:00 A
3: 2019-06-03 00:00:00 2019-06-03 14:00:00 A
4: 2019-06-01 09:00:00 2019-06-02 00:00:00 B
5: 2016-06-02 00:00:00 2019-06-02 05:00:00 B
6: 2016-06-01 09:00:00 2016-06-01 23:15:00 C
如有任何帮助,我们将不胜感激!
假设第 5 行的人 B 有错字(从 2019 年开始,而不是 2016 年):
library(data.table)
library(lubridate)
> DT <- data.table(begin = c("2019-06-01 09:00:00","2019-06-01 09:00:00", "2019-06-01 09:00:00",
+ "2019-06-01 09:00:00", "2019-06-01 09:00:00","2016-06-01 09:00:00"),
+ end = c("2019-06-03 14:00:00", "2019-06-03 14:00:00", "2019-06-03 14:00:00",
+ "2019-06-02 05:00:00", "2019-06-02 05:00:00", "2016-06-01 23:15:00"),
+ person = c("A", "A","A", "B", "B", "C"))
>
> DT[, `:=`(min=as.numeric(difftime(end,begin, units="mins")),
+ days=as.numeric(as_date(end)-as_date(begin)+1))][, min_day:=min/days]
>
> unique(DT)
begin end person min days min_day
1: 2019-06-01 09:00:00 2019-06-03 14:00:00 A 3180 3 1060
2: 2019-06-01 09:00:00 2019-06-02 05:00:00 B 1200 2 600
3: 2016-06-01 09:00:00 2016-06-01 23:15:00 C 855 1 855
首先,确定日期(我已经确定了从 2016 年开始到 2019 年的第 5 行,这似乎不太可能),
DT[, c("begin", "end"):=lapply(.SD, as.POSIXct), .SDcols=c("begin", "end")]
## we get this
DT <- as.data.table(structure(list(begin = structure(c(1559394000, 1559394000, 1559394000, 1559394000, 1559394000, 1464786000), class = c("POSIXct", "POSIXt"), tzone = ""), end = structure(c(1559584800, 1559584800, 1559584800, 1559466000, 1559466000, 1464837300), class = c("POSIXct", "POSIXt"), tzone = ""), person = c("A", "A", "A", "B", "B", "C")), row.names = c(NA, -6L), class = c("data.table", "data.frame")))
其次,我们接着创建这个函数
func <- function(st, en) {
midns <- lubridate::ceiling_date(seq(st, en, by = "day"), unit = "day")
times <- unique(sort(c(midns[ st < midns & midns < en], st, en)))
data.table(begin = times[-length(times)], end = times[-1])
}
最后,我们使用它,使用 by=.(person)
在输出中保留该列。我使用 DT
因为我们不需要(甚至 想要 )重复每个 shift/day:
unique(DT)[, rbindlist(Map(func, begin, end)), by = .(person)]
# person begin end
# <char> <POSc> <POSc>
# 1: A 2019-06-01 09:00:00 2019-06-02 00:00:00
# 2: A 2019-06-02 00:00:00 2019-06-03 00:00:00
# 3: A 2019-06-03 00:00:00 2019-06-03 14:00:00
# 4: B 2019-06-01 09:00:00 2019-06-02 00:00:00
# 5: B 2019-06-02 00:00:00 2019-06-02 05:00:00
# 6: C 2016-06-01 09:00:00 2016-06-01 23:15:00