将日期范围分成几个以 YYYY-12-31 结尾的块
Split date range into several chunks ending by YYYY-12-31
df <- data.frame(group = c("a", "a", "b", "b"),
start = c("2017-05-01", "2019-04-03", "2011-03-03", "2014-05-07"),
end = c("2018-09-01", "2020-04-03", "2012-05-03", "2016-04-02"))
假设我有以下 df:
group start end
1 a 2017-05-01 2018-09-01
2 a 2019-04-03 2020-04-03
3 b 2011-03-03 2012-05-03
4 b 2014-05-07 2016-04-02
我想把它变成这种格式,每条记录分为开始日期和该年的 31/12 以及随后的年份:
group start end
1 a 2017-05-01 2017-12-31
2 a 2018-01-01 2018-09-01
3 a 2019-04-03 2019-12-31
4 a 2020-01-01 2020-04-03
5 b 2011-03-03 2011-12-31
6 b 2012-01-01 2012-05-03
7 b 2014-05-07 2014-12-31
8 b 2015-01-01 2015-12-31
9 b 2016-01-01 2016-04-02
关于如何解决这个问题有什么想法吗?
编辑:
我主要关心的不是同一年内的日期范围。然而,正如 chinsoon12 指出的那样,如果该方法也能处理它们,那确实会有所帮助,例如在这个数据集中:
df <- data.frame(group = c("a", "a", "b", "b", "c"),
start = c("2017-05-01", "2019-04-03", "2011-03-03", "2014-05-07", "2017-02-01"),
end = c("2018-09-01", "2020-04-03", "2012-05-03", "2016-04-02", "2017-04-05"))
最终结果将保留最后一行:
group start end
1 a 2017-05-01 2017-12-31
2 a 2018-01-01 2018-09-01
3 a 2019-04-03 2019-12-31
4 a 2020-01-01 2020-04-03
5 b 2011-03-03 2011-12-31
6 b 2012-01-01 2012-05-03
7 b 2014-05-07 2014-12-31
8 b 2015-01-01 2015-12-31
9 b 2016-01-01 2016-04-02
10 c 2017-02-01 2017-04-05
data.table可能的解决方案:
library(data.table)
setDT(df)
df[df[, rep(.I, 1 + year(end) - year(start))]
][, `:=` (start = pmax(start[1], as.Date(paste0(year(start[1]) + 0:(.N-1), '-01-01'))),
end = pmin(end[.N], as.Date(paste0(year(end[.N]) - (.N-1):0, '-12-31'))))
, by = .(group, rleid(start))][]
给出:
group start end
1: a 2017-05-01 2017-12-31
2: a 2018-01-01 2018-09-01
3: a 2019-04-03 2019-12-31
4: a 2020-01-01 2020-04-03
5: b 2011-03-03 2011-12-31
6: b 2012-01-01 2012-05-03
7: b 2014-05-07 2014-12-31
8: b 2015-01-01 2015-12-31
9: b 2016-01-01 2016-04-02
10: c 2017-02-01 2017-04-05
两个备选方案data.table:
# alternative 1:
df[, ri := rowid(group)
][df[, rep(.I, 1 + year(end) - year(start))]
][, `:=` (start = if (.N == 1) start else c(start[1], as.Date(paste0(year(start[1]) + 1:(.N-1), '-01-01') )),
end = if (.N == 1) end else c(as.Date(paste0(year(end[.N]) - (.N-1):1, '-12-31') ), end[.N]))
, by = .(group, ri)][, ri := NULL][]
# alternative 2:
df[, ri := rowid(group)
][df[, rep(.I, 1 + year(end) - year(start))]
][, `:=` (start = pmax(start[1], as.Date(paste0(year(start[1]) + 0:(.N-1), '-01-01'))),
end = pmin(end[.N], as.Date(paste0(year(end[.N]) - (.N-1):0, '-12-31'))))
, by = .(group, ri)][, ri := NULL][]
已用数据:
df <- data.frame(group = c("a", "a", "b", "b", "c"),
start = c("2017-05-01", "2019-04-03", "2011-03-03", "2014-05-07", "2017-02-01"),
end = c("2018-09-01", "2020-04-03", "2012-05-03", "2016-04-02", "2017-04-05"))
df[2:3] <- lapply(df[2:3], as.Date)
这里是 no-tidyverse/no-data.table 版本:
df <- data.frame(group = c("a", "a", "b", "b"),
start = c("2017-05-01", "2019-04-03", "2011-03-03", "2014-05-07"),
end = c("2018-09-01", "2020-04-03", "2012-05-03", "2016-04-02"), stringsAsFactors=FALSE)
# added stringsAsFactors =FALSE to your df for sanity
# reformatting start and end as Date
df$start <- as.Date(df$start)
df$end <- as.Date(df$end)
dfs <- split(df, rownames(df))
# split the data frame by rows
res <- do.call(rbind, lapply(dfs, function(.){
s <- seq(from=.$start, to=.$end, by="day")
# sequence form df$start to df$end, by days
y <- format(s, "%Y")
# years of that sequence
s2 <- as.character(s)
# formatting s as character -- otherwise sapply will get rid of the
# Date class and the result will look as numeric
ys <- split(s2,y)
# split the sequence by years
data.frame(group=.$group, start=sapply(ys, head,1), end = sapply(ys, tail, 1), stringsAsFactors=FALSE)
# take the first and last element from each "sub-vector" of the split sequence
}))
rownames(res) <- NULL # kill the nasty rownames
res
group start end
1 a 2017-05-01 2017-12-31
2 a 2018-01-01 2018-09-01
3 a 2019-04-03 2019-12-31
4 a 2020-01-01 2020-04-03
5 b 2011-03-03 2011-12-31
6 b 2012-01-01 2012-05-03
7 b 2014-05-07 2014-12-31
8 b 2015-01-01 2015-12-31
9 b 2016-01-01 2016-04-02
请注意,结果的 start
和 end
列与原始数据框中的 character
相同。
我对 base R 处理日期(和 POSIXct)对象的方式感到抱歉 - 你永远不知道它们什么时候会失去它们的 class 并变成简单的数字。在这里,我通过将日期视为字符来避免这种 "feature" ,除非需要日期操作,例如在创建日期序列时。
library(tidyverse)
library(lubridate)
df%>%
mutate(end=as.Date(end),
start=as.Date(start),
diff=Map(":",0,1+year(end)-year(start)-1))%>%
unnest()%>%
mutate(end=pmin(end,as.Date(paste0(year(start)+diff,"-12-31"))),
start=pmax(start,as.Date(paste0(year(start)+diff,"-1-1"))),
diff=NULL)
A tibble: 9 x 3
group start end
<fct> <date> <date>
1 a 2017-05-02 2017-12-31
2 a 2018-01-01 2018-09-01
3 a 2019-04-03 2019-12-31
4 a 2020-01-01 2020-04-03
5 b 2020-01-01 2011-12-31
6 b 2012-01-01 2012-05-03
7 b 2014-05-07 2014-12-31
8 b 2015-01-01 2015-12-31
9 b 2015-01-01 2016-04-02
使用更新后的数据 运行 您将得到这个确切的函数:
group start end
1 a 2017-05-01 2017-12-31
2 a 2018-01-01 2018-09-01
3 a 2019-04-03 2019-12-31
4 a 2020-01-01 2020-04-03
5 b 2011-03-03 2011-12-31
6 b 2012-01-01 2012-05-03
7 b 2014-05-07 2014-12-31
8 b 2015-01-01 2015-12-31
9 b 2016-01-01 2016-04-02
10 c 2017-02-01 2017-04-05
df <- data.frame(group = c("a", "a", "b", "b"),
start = c("2017-05-01", "2019-04-03", "2011-03-03", "2014-05-07"),
end = c("2018-09-01", "2020-04-03", "2012-05-03", "2016-04-02"))
假设我有以下 df:
group start end
1 a 2017-05-01 2018-09-01
2 a 2019-04-03 2020-04-03
3 b 2011-03-03 2012-05-03
4 b 2014-05-07 2016-04-02
我想把它变成这种格式,每条记录分为开始日期和该年的 31/12 以及随后的年份:
group start end
1 a 2017-05-01 2017-12-31
2 a 2018-01-01 2018-09-01
3 a 2019-04-03 2019-12-31
4 a 2020-01-01 2020-04-03
5 b 2011-03-03 2011-12-31
6 b 2012-01-01 2012-05-03
7 b 2014-05-07 2014-12-31
8 b 2015-01-01 2015-12-31
9 b 2016-01-01 2016-04-02
关于如何解决这个问题有什么想法吗?
编辑:
我主要关心的不是同一年内的日期范围。然而,正如 chinsoon12 指出的那样,如果该方法也能处理它们,那确实会有所帮助,例如在这个数据集中:
df <- data.frame(group = c("a", "a", "b", "b", "c"),
start = c("2017-05-01", "2019-04-03", "2011-03-03", "2014-05-07", "2017-02-01"),
end = c("2018-09-01", "2020-04-03", "2012-05-03", "2016-04-02", "2017-04-05"))
最终结果将保留最后一行:
group start end
1 a 2017-05-01 2017-12-31
2 a 2018-01-01 2018-09-01
3 a 2019-04-03 2019-12-31
4 a 2020-01-01 2020-04-03
5 b 2011-03-03 2011-12-31
6 b 2012-01-01 2012-05-03
7 b 2014-05-07 2014-12-31
8 b 2015-01-01 2015-12-31
9 b 2016-01-01 2016-04-02
10 c 2017-02-01 2017-04-05
data.table可能的解决方案:
library(data.table)
setDT(df)
df[df[, rep(.I, 1 + year(end) - year(start))]
][, `:=` (start = pmax(start[1], as.Date(paste0(year(start[1]) + 0:(.N-1), '-01-01'))),
end = pmin(end[.N], as.Date(paste0(year(end[.N]) - (.N-1):0, '-12-31'))))
, by = .(group, rleid(start))][]
给出:
group start end 1: a 2017-05-01 2017-12-31 2: a 2018-01-01 2018-09-01 3: a 2019-04-03 2019-12-31 4: a 2020-01-01 2020-04-03 5: b 2011-03-03 2011-12-31 6: b 2012-01-01 2012-05-03 7: b 2014-05-07 2014-12-31 8: b 2015-01-01 2015-12-31 9: b 2016-01-01 2016-04-02 10: c 2017-02-01 2017-04-05
两个备选方案data.table:
# alternative 1:
df[, ri := rowid(group)
][df[, rep(.I, 1 + year(end) - year(start))]
][, `:=` (start = if (.N == 1) start else c(start[1], as.Date(paste0(year(start[1]) + 1:(.N-1), '-01-01') )),
end = if (.N == 1) end else c(as.Date(paste0(year(end[.N]) - (.N-1):1, '-12-31') ), end[.N]))
, by = .(group, ri)][, ri := NULL][]
# alternative 2:
df[, ri := rowid(group)
][df[, rep(.I, 1 + year(end) - year(start))]
][, `:=` (start = pmax(start[1], as.Date(paste0(year(start[1]) + 0:(.N-1), '-01-01'))),
end = pmin(end[.N], as.Date(paste0(year(end[.N]) - (.N-1):0, '-12-31'))))
, by = .(group, ri)][, ri := NULL][]
已用数据:
df <- data.frame(group = c("a", "a", "b", "b", "c"),
start = c("2017-05-01", "2019-04-03", "2011-03-03", "2014-05-07", "2017-02-01"),
end = c("2018-09-01", "2020-04-03", "2012-05-03", "2016-04-02", "2017-04-05"))
df[2:3] <- lapply(df[2:3], as.Date)
这里是 no-tidyverse/no-data.table 版本:
df <- data.frame(group = c("a", "a", "b", "b"),
start = c("2017-05-01", "2019-04-03", "2011-03-03", "2014-05-07"),
end = c("2018-09-01", "2020-04-03", "2012-05-03", "2016-04-02"), stringsAsFactors=FALSE)
# added stringsAsFactors =FALSE to your df for sanity
# reformatting start and end as Date
df$start <- as.Date(df$start)
df$end <- as.Date(df$end)
dfs <- split(df, rownames(df))
# split the data frame by rows
res <- do.call(rbind, lapply(dfs, function(.){
s <- seq(from=.$start, to=.$end, by="day")
# sequence form df$start to df$end, by days
y <- format(s, "%Y")
# years of that sequence
s2 <- as.character(s)
# formatting s as character -- otherwise sapply will get rid of the
# Date class and the result will look as numeric
ys <- split(s2,y)
# split the sequence by years
data.frame(group=.$group, start=sapply(ys, head,1), end = sapply(ys, tail, 1), stringsAsFactors=FALSE)
# take the first and last element from each "sub-vector" of the split sequence
}))
rownames(res) <- NULL # kill the nasty rownames
res
group start end
1 a 2017-05-01 2017-12-31
2 a 2018-01-01 2018-09-01
3 a 2019-04-03 2019-12-31
4 a 2020-01-01 2020-04-03
5 b 2011-03-03 2011-12-31
6 b 2012-01-01 2012-05-03
7 b 2014-05-07 2014-12-31
8 b 2015-01-01 2015-12-31
9 b 2016-01-01 2016-04-02
请注意,结果的 start
和 end
列与原始数据框中的 character
相同。
我对 base R 处理日期(和 POSIXct)对象的方式感到抱歉 - 你永远不知道它们什么时候会失去它们的 class 并变成简单的数字。在这里,我通过将日期视为字符来避免这种 "feature" ,除非需要日期操作,例如在创建日期序列时。
library(tidyverse)
library(lubridate)
df%>%
mutate(end=as.Date(end),
start=as.Date(start),
diff=Map(":",0,1+year(end)-year(start)-1))%>%
unnest()%>%
mutate(end=pmin(end,as.Date(paste0(year(start)+diff,"-12-31"))),
start=pmax(start,as.Date(paste0(year(start)+diff,"-1-1"))),
diff=NULL)
A tibble: 9 x 3
group start end
<fct> <date> <date>
1 a 2017-05-02 2017-12-31
2 a 2018-01-01 2018-09-01
3 a 2019-04-03 2019-12-31
4 a 2020-01-01 2020-04-03
5 b 2020-01-01 2011-12-31
6 b 2012-01-01 2012-05-03
7 b 2014-05-07 2014-12-31
8 b 2015-01-01 2015-12-31
9 b 2015-01-01 2016-04-02
使用更新后的数据 运行 您将得到这个确切的函数:
group start end
1 a 2017-05-01 2017-12-31
2 a 2018-01-01 2018-09-01
3 a 2019-04-03 2019-12-31
4 a 2020-01-01 2020-04-03
5 b 2011-03-03 2011-12-31
6 b 2012-01-01 2012-05-03
7 b 2014-05-07 2014-12-31
8 b 2015-01-01 2015-12-31
9 b 2016-01-01 2016-04-02
10 c 2017-02-01 2017-04-05