组和子集时间

Group and subset time

我在数据框中有时间数据,如下所示:

          date day       time      phone      lat      lon acc       update
6   12/08/2014 Tue 07:25:35PM 9052780809 17.41653 78.40537 3.9 1.406988e+12
44  12/08/2014 Tue 07:26:35PM 9052780809 17.41823 78.40344 3.9 1.406988e+12
114 12/08/2014 Tue 07:28:32PM 9052780809 17.41810 78.39846 3.9 1.406988e+12
152 12/08/2014 Tue 07:29:30PM 9052780809 17.41760 78.39512 3.9 1.406988e+12
188 12/08/2014 Tue 07:30:31PM 9052780809 17.41517 78.39426 3.9 1.406988e+12
223 12/08/2014 Tue 07:31:30PM 9052780809 17.41467 78.39434 3.9 1.406988e+12

大多数时间相差 1-2 分钟,但也有情况相差超过 10 分钟,例如二读后。如果它们之间的差异超过 10 分钟,则连续读数可能在不同的日期。我想在阅读后插入一个休息时间,它们之间的间隔超过 10 分钟,并将它们插入另一个数据框以进一步处理它们。

             date day       time      phone      lat      lon acc       update
145315 16/08/2014 Sat 11:54:57AM 9052780809 17.41377 78.45923 3.9 1.406988e+12
145371 16/08/2014 Sat 11:55:56AM 9052780809 17.41626 78.45750 3.9 1.406988e+12
145426 16/08/2014 Sat 11:56:55AM 9052780809 17.41746 78.45547 4.0 1.406988e+12
162349 16/08/2014 Sat 05:02:51PM 9052780809 17.41562 78.44446 3.9 1.406988e+12
162404 16/08/2014 Sat 05:03:55PM 9052780809 17.41577 78.44113 3.9 1.406988e+12
162452 16/08/2014 Sat 05:04:51PM 9052780809 17.41638 78.43815 3.9 1.406988e+12

原始数据有8列,超过700000行

只是从评论中粘贴,以便问题得到解答。您可以使用 split(@docendo discimus 建议)和 difftime(来自@Laurik)来获取预期的数据集。

假设 "time1" 是数据集 ("dat") 中的 "time" 列,使用 "time1" 转换为 "POSIXlt" class =15=],用difftime得到连续元素之间"minutes"的差值。在这里,我删除了最后一个元素和第一个元素,以便我们可以找到当前 dt1[-length(dt1)] 和下一个元素 dt1[-1] 之间的差异,应用条件 >10cumsum 逻辑索引split 数据集基于该索引得到 data.frames (lst) 的列表。在列表中工作可能比创建单个 data.frame 对象更好。

dt1 <- strptime(dat$time1, format='%I:%M:%OS%p')
lst <- split(dat, cumsum(c(FALSE,difftime(dt1[-length(dt1)],
                            dt1[-1], unit='min')>10)))

更新

使用新数据集dat

 dt1 <- with(dat, strptime(paste(date, time),
                     format='%d/%m/%Y %I:%M:%OS%p'))

 indx <- cumsum(c(FALSE, abs(difftime(dt1[-length(dt1)], dt1[-1], 
       unit='min')) >10))
 split(dat, indx)
 #$`0`
 #        date day       time      phone      lat      lon acc       update
 #6   12/08/2014 Tue 07:25:35PM 9052780809 17.41653 78.40537 3.9 1.406988e+12
 #44  12/08/2014 Tue 07:26:35PM 9052780809 17.41823 78.40344 3.9 1.406988e+12
 #114 12/08/2014 Tue 07:28:32PM 9052780809 17.41810 78.39846 3.9 1.406988e+12
 #152 12/08/2014 Tue 07:29:30PM 9052780809 17.41760 78.39512 3.9 1.406988e+12
 #188 12/08/2014 Tue 07:30:31PM 9052780809 17.41517 78.39426 3.9 1.406988e+12
 #223 12/08/2014 Tue 07:31:30PM 9052780809 17.41467 78.39434 3.9 1.406988e+12

 #$`1`
 #           date day       time      phone      lat      lon acc       update
 #145315 16/08/2014 Sat 11:54:57AM 9052780809 17.41377 78.45923 3.9 1.406988e+12
 #145371 16/08/2014 Sat 11:55:56AM 9052780809 17.41626 78.45750 3.9 1.406988e+12
 #145426 16/08/2014 Sat 11:56:55AM 9052780809 17.41746 78.45547 4.0 1.406988e+12

#$`2`
#            date day       time      phone      lat      lon acc       update
#162349 16/08/2014 Sat 05:02:51PM 9052780809 17.41562 78.44446 3.9 1.406988e+12
#162404 16/08/2014 Sat 05:03:55PM 9052780809 17.41577 78.44113 3.9 1.406988e+12
#162452 16/08/2014 Sat 05:04:51PM 9052780809 17.41638 78.43815 3.9 1.406988e+12

数据

dat <-     structure(list(date = c("12/08/2014", "12/08/2014", "12/08/2014", 
 "12/08/2014", "12/08/2014", "12/08/2014", "16/08/2014", "16/08/2014", 
 "16/08/2014", "16/08/2014", "16/08/2014", "16/08/2014"), day = c("Tue", 
 "Tue", "Tue", "Tue", "Tue", "Tue", "Sat", "Sat", "Sat", "Sat", 
 "Sat", "Sat"), time = c("07:25:35PM", "07:26:35PM", "07:28:32PM", 
 "07:29:30PM", "07:30:31PM", "07:31:30PM", "11:54:57AM", "11:55:56AM", 
 "11:56:55AM", "05:02:51PM", "05:03:55PM", "05:04:51PM"), phone = c(9052780809, 
 9052780809, 9052780809, 9052780809, 9052780809, 9052780809, 9052780809, 
 9052780809, 9052780809, 9052780809, 9052780809, 9052780809), 
 lat = c(17.41653, 17.41823, 17.4181, 17.4176, 17.41517, 17.41467, 
 17.41377, 17.41626, 17.41746, 17.41562, 17.41577, 17.41638
 ), lon = c(78.40537, 78.40344, 78.39846, 78.39512, 78.39426, 
 78.39434, 78.45923, 78.4575, 78.45547, 78.44446, 78.44113, 
 78.43815), acc = c(3.9, 3.9, 3.9, 3.9, 3.9, 3.9, 3.9, 3.9, 
 4, 3.9, 3.9, 3.9), update = c(1.406988e+12, 1.406988e+12, 
 1.406988e+12, 1.406988e+12, 1.406988e+12, 1.406988e+12, 1.406988e+12, 
 1.406988e+12, 1.406988e+12, 1.406988e+12, 1.406988e+12, 1.406988e+12
 )), .Names = c("date", "day", "time", "phone", "lat", "lon", 
 "acc", "update"), class = "data.frame", row.names = c("6", "44", 
 "114", "152", "188", "223", "145315", "145371", "145426", "162349", 
 "162404", "162452"))