dplyr:将会话中的事件组合在一起
dplyr: group events in a session together
我有一个数据框,如下所示。我想将每个 "unique" 会话的事件组合在一起。例如,在以下情况中,ID 号 1 与我的系统进行了两次交互,并进行了两次会话。我想 "spread" (tidyr) 数据,但每个会话。不是每个ID。我如何使用 dplyr 和 tidyr 来做到这一点?
> df
id event time
1 1 start 2015-05-16 22:46:53
2 1 valid 2015-05-16 22:46:56
3 1 end 2015-05-16 22:46:59
4 2 start 2015-05-16 22:46:53
5 2 bad 2015-05-16 22:47:00
6 1 start 2015-05-16 22:49:05
7 1 bad 2015-05-16 22:49:09
>
所需的输出类似于以下内容:
> df1
nid starttime validtime badtime endtime
1 1 2015-05-16 22:46:53 2015-05-16 22:46:56 <NA> 2015-05-16 22:46:59
2 2 2015-05-16 22:46:53 <NA> 2015-05-16 22:47:00 <NA>
3 1 2015-05-16 22:49:05 <NA> 2015-05-16 22:49:09 <NA>
这是一种方法。我不确定您是否有时间作为日期对象或字符对象。在这里,我在 mydf
中创建了 time
作为日期对象。当我重塑数据时,我意识到 spread()
将时间对象转换为数字。因此,我决定先将 time
转换为字符。然后,我创建了一个名为 group
的新变量,它有助于使用 spread()
重塑数据。为了保持你想要的顺序,我使用了arrange()
。我用 select()
更改了列名。最后,我将 time
转换为日期对象。
library(dplyr)
library(tidyr)
mydf <- data.frame(id = c(1,1,1,2,2,1,1),
event = c("start", "valid", "end", "start", "bad", "start", "bad"),
time = as.POSIXct(c("2015-05-16 22:46:53", "2015-05-16 22:46:56", "2015-05-16 22:46:59",
"2015-05-16 22:46:53", "2015-05-16 22:47:00", "2015-05-16 22:49:05",
"2015-05-16 22:49:09"), format = "%Y-%m-%d %H:%M:%S"),
stringsAsFactors = FALSE)
mutate(mydf, time = as.character(time),
group = cumsum(c(T, diff(id) != 0))) %>%
spread(event, time) %>%
arrange(group) %>%
select(id, starttime = start, validtime = valid, badtime = bad, endtime = end) %>%
mutate_each(funs(as.POSIXct(., format = "%Y-%m-%d %H:%M:%S")), starttime:endtime)
# id starttime validtime badtime endtime
#1 1 2015-05-16 22:46:53 2015-05-16 22:46:56 <NA> 2015-05-16 22:46:59
#2 2 2015-05-16 22:46:53 <NA> 2015-05-16 22:47:00 <NA>
#3 1 2015-05-16 22:49:05 <NA> 2015-05-16 22:49:09 <NA>
使用 data.table
的选项。使用 data.table
的开发版本中的 rleid
和 dcast
,即 v1.9.5(安装说明为 here
),我们可以将 'long' 格式转换为 'wide'格式。
library(data.table)#v1.9.5+
dcast(setDT(df)[, gr:= rleid(id)], id+gr~paste0(event, 'time'),
value.var='time')[order(starttime)][, c(1, 5:6, 3:4), with=FALSE]
# id starttime validtime badtime
#1: 1 2015-05-16 22:46:53 2015-05-16 22:46:56 <NA>
#2: 2 2015-05-16 22:46:53 <NA> 2015-05-16 22:47:00
#3: 1 2015-05-16 22:49:05 <NA> 2015-05-16 22:49:09
# endtime
#1: 2015-05-16 22:46:59
#2: <NA>
#3: <NA>
数据
df <- structure(list(id = c(1L, 1L, 1L, 2L, 2L, 1L, 1L),
event = c("start",
"valid", "end", "start", "bad", "start", "bad"),
time = structure(c(1431816413,
1431816416, 1431816419, 1431816413, 1431816420, 1431816545,
1431816549
), class = c("POSIXct", "POSIXt"), tzone = "%Y-%m-%d %H:%M:%S")),
.Names = c("id",
"event", "time"), row.names = c("1", "2", "3", "4", "5", "6",
"7"), class = "data.frame")
我有一个数据框,如下所示。我想将每个 "unique" 会话的事件组合在一起。例如,在以下情况中,ID 号 1 与我的系统进行了两次交互,并进行了两次会话。我想 "spread" (tidyr) 数据,但每个会话。不是每个ID。我如何使用 dplyr 和 tidyr 来做到这一点?
> df
id event time
1 1 start 2015-05-16 22:46:53
2 1 valid 2015-05-16 22:46:56
3 1 end 2015-05-16 22:46:59
4 2 start 2015-05-16 22:46:53
5 2 bad 2015-05-16 22:47:00
6 1 start 2015-05-16 22:49:05
7 1 bad 2015-05-16 22:49:09
>
所需的输出类似于以下内容:
> df1
nid starttime validtime badtime endtime
1 1 2015-05-16 22:46:53 2015-05-16 22:46:56 <NA> 2015-05-16 22:46:59
2 2 2015-05-16 22:46:53 <NA> 2015-05-16 22:47:00 <NA>
3 1 2015-05-16 22:49:05 <NA> 2015-05-16 22:49:09 <NA>
这是一种方法。我不确定您是否有时间作为日期对象或字符对象。在这里,我在 mydf
中创建了 time
作为日期对象。当我重塑数据时,我意识到 spread()
将时间对象转换为数字。因此,我决定先将 time
转换为字符。然后,我创建了一个名为 group
的新变量,它有助于使用 spread()
重塑数据。为了保持你想要的顺序,我使用了arrange()
。我用 select()
更改了列名。最后,我将 time
转换为日期对象。
library(dplyr)
library(tidyr)
mydf <- data.frame(id = c(1,1,1,2,2,1,1),
event = c("start", "valid", "end", "start", "bad", "start", "bad"),
time = as.POSIXct(c("2015-05-16 22:46:53", "2015-05-16 22:46:56", "2015-05-16 22:46:59",
"2015-05-16 22:46:53", "2015-05-16 22:47:00", "2015-05-16 22:49:05",
"2015-05-16 22:49:09"), format = "%Y-%m-%d %H:%M:%S"),
stringsAsFactors = FALSE)
mutate(mydf, time = as.character(time),
group = cumsum(c(T, diff(id) != 0))) %>%
spread(event, time) %>%
arrange(group) %>%
select(id, starttime = start, validtime = valid, badtime = bad, endtime = end) %>%
mutate_each(funs(as.POSIXct(., format = "%Y-%m-%d %H:%M:%S")), starttime:endtime)
# id starttime validtime badtime endtime
#1 1 2015-05-16 22:46:53 2015-05-16 22:46:56 <NA> 2015-05-16 22:46:59
#2 2 2015-05-16 22:46:53 <NA> 2015-05-16 22:47:00 <NA>
#3 1 2015-05-16 22:49:05 <NA> 2015-05-16 22:49:09 <NA>
使用 data.table
的选项。使用 data.table
的开发版本中的 rleid
和 dcast
,即 v1.9.5(安装说明为 here
),我们可以将 'long' 格式转换为 'wide'格式。
library(data.table)#v1.9.5+
dcast(setDT(df)[, gr:= rleid(id)], id+gr~paste0(event, 'time'),
value.var='time')[order(starttime)][, c(1, 5:6, 3:4), with=FALSE]
# id starttime validtime badtime
#1: 1 2015-05-16 22:46:53 2015-05-16 22:46:56 <NA>
#2: 2 2015-05-16 22:46:53 <NA> 2015-05-16 22:47:00
#3: 1 2015-05-16 22:49:05 <NA> 2015-05-16 22:49:09
# endtime
#1: 2015-05-16 22:46:59
#2: <NA>
#3: <NA>
数据
df <- structure(list(id = c(1L, 1L, 1L, 2L, 2L, 1L, 1L),
event = c("start",
"valid", "end", "start", "bad", "start", "bad"),
time = structure(c(1431816413,
1431816416, 1431816419, 1431816413, 1431816420, 1431816545,
1431816549
), class = c("POSIXct", "POSIXt"), tzone = "%Y-%m-%d %H:%M:%S")),
.Names = c("id",
"event", "time"), row.names = c("1", "2", "3", "4", "5", "6",
"7"), class = "data.frame")