按组识别连续 Ob 的运行并重塑
Identify Runs of Consecutive Obs by Group and Reshape
我正在尝试识别 运行 的连续观察,将它们分组并重新整形,以便每个 运行 的开始和结束占据一列。视觉上:
## REPRODUCIBLE EXAMPLE
> dput(example)
structure(list(id = c(123, 123, 123, 123, 123, 123, 123, 123,
234, 234, 234), date = structure(c(1398816000, 1398902400, 1398988800,
1399075200, 1399161600, 1350777600, 1350864000, 1350950400, 1470009600,
1470096000, 1470182400), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
event = structure(c(1L, 2L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 2L,
1L), .Label = c("0", "1"), class = "factor")), row.names = c(NA,
-11L), .Names = c("id", "date", "event"), class = c("tbl_df",
"tbl", "data.frame"))
## GLIMPSE DATA
> dplyr::glimpse(example)
Observations: 11
Variables: 3
$ id <dbl> 123, 123, 123, 123, 123, 123, 123, 123, 234, 234, 234
$ date <dttm> 2014-04-30, 2014-05-01, 2014-05-02, 2014-05-03, 2014-05-04, 2012-10-21, 2012-10-22, 2012-10-23, 2016-08-01, 2016-08-02, 2016-08-03
$ event <fctr> 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0
我将方法分解如下:
- 按
id
分组数据
rle
识别 运行s 的连续观察
在 id
之内(例如 rle(example$event > 0)
)
- 从长到宽重新整形,其中 min(date) 和 max(date)(在 运行s 内)成为列
我不确定如何进行。 similar question 的 data.table
解决方案很接近,但我无法重新利用它。
借鉴 的想法:
df1 %>%
mutate(eventGroup = data.table::rleid(event)) %>%
filter(event == 1) %>%
group_by(id, eventGroup) %>%
summarise(start = min(date),
end = max(date))
# id eventGroup start end
# 1 123 2 2014-05-01 2014-05-03
# 2 123 4 2012-10-22 2012-10-22
# 3 234 6 2016-08-02 2016-08-02
还有一个选项:
library(data.table)
setDT(ex)[,rl:=rleid(event),by=id][event=="1",.(start=min(date),stop=max(date)),by="id,rl"][,rl:=NULL][]
# id start stop
# 1: 123 2014-05-01 2014-05-03
# 2: 123 2012-10-22 2012-10-22
# 3: 234 2016-08-02 2016-08-02
我正在尝试识别 运行 的连续观察,将它们分组并重新整形,以便每个 运行 的开始和结束占据一列。视觉上:
## REPRODUCIBLE EXAMPLE
> dput(example)
structure(list(id = c(123, 123, 123, 123, 123, 123, 123, 123,
234, 234, 234), date = structure(c(1398816000, 1398902400, 1398988800,
1399075200, 1399161600, 1350777600, 1350864000, 1350950400, 1470009600,
1470096000, 1470182400), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
event = structure(c(1L, 2L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 2L,
1L), .Label = c("0", "1"), class = "factor")), row.names = c(NA,
-11L), .Names = c("id", "date", "event"), class = c("tbl_df",
"tbl", "data.frame"))
## GLIMPSE DATA
> dplyr::glimpse(example)
Observations: 11
Variables: 3
$ id <dbl> 123, 123, 123, 123, 123, 123, 123, 123, 234, 234, 234
$ date <dttm> 2014-04-30, 2014-05-01, 2014-05-02, 2014-05-03, 2014-05-04, 2012-10-21, 2012-10-22, 2012-10-23, 2016-08-01, 2016-08-02, 2016-08-03
$ event <fctr> 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0
我将方法分解如下:
- 按
id
分组数据
rle
识别 运行s 的连续观察 在id
之内(例如rle(example$event > 0)
)- 从长到宽重新整形,其中 min(date) 和 max(date)(在 运行s 内)成为列
我不确定如何进行。 similar question 的 data.table
解决方案很接近,但我无法重新利用它。
借鉴
df1 %>%
mutate(eventGroup = data.table::rleid(event)) %>%
filter(event == 1) %>%
group_by(id, eventGroup) %>%
summarise(start = min(date),
end = max(date))
# id eventGroup start end
# 1 123 2 2014-05-01 2014-05-03
# 2 123 4 2012-10-22 2012-10-22
# 3 234 6 2016-08-02 2016-08-02
还有一个选项:
library(data.table)
setDT(ex)[,rl:=rleid(event),by=id][event=="1",.(start=min(date),stop=max(date)),by="id,rl"][,rl:=NULL][]
# id start stop
# 1: 123 2014-05-01 2014-05-03
# 2: 123 2012-10-22 2012-10-22
# 3: 234 2016-08-02 2016-08-02