R将第二数据聚合到分钟更有效

R aggregate second data to minutes more efficient

我有一个 data.table、allData,其中包含来自不同夜晚的大约每 (POSIXct) 秒的数据。然而,有些晚上是同一天,因为数据是从不同的人那里收集的,所以我有一个 nightNo 列作为每个不同晚上的 id。

          timestamp  nightNo    data1     data2
2018-10-19 19:15:00        1        1         7
2018-10-19 19:15:01        1        2         8
2018-10-19 19:15:02        1        3         9
2018-10-19 18:10:22        2        4        10
2018-10-19 18:10:23        2        5        11 
2018-10-19 18:10:24        2        6        12

我想将数据聚合到分钟(每晚)并使用 this question 我想出了以下代码:

aggregate_minute <- function(df){
  df %>% 
    group_by(timestamp = cut(timestamp, breaks= "1 min")) %>%
    summarise(data1= mean(data1), data2= mean(data2)) %>%
    as.data.table()
 }

allData <- allData[, aggregate_minute(allData), by=nightNo]

但是我的 data.table 相当大,而且这段代码不够快。有没有更有效的方法来解决这个问题?

allData <- data.table(timestamp = c(rep(Sys.time(), 3), rep(Sys.time() + 320, 3)), 
                     nightNo = rep(1:2, c(3, 3)),
                     data1 = 1:6,
                     data2  = 7:12)
                 timestamp nightNo data1 data2
1: 2018-06-14 10:43:11       1     1     7
2: 2018-06-14 10:43:11       1     2     8
3: 2018-06-14 10:43:11       1     3     9
4: 2018-06-14 10:48:31       2     4    10
5: 2018-06-14 10:48:31       2     5    11
6: 2018-06-14 10:48:31       2     6    12


allData[, .(data1 = mean(data1), data2 = mean(data2)), by = .(nightNo, timestamp = cut(timestamp, breaks= "1 min"))]
       nightNo           timestamp data1 data2
1:       1 2018-06-14 10:43:00     2     8
2:       2 2018-06-14 10:48:00     5    11

> system.time(replicate(500, allData[, aggregate_minute(allData), by=nightNo]))
    user  system elapsed 
    3.25    0.02    3.31 

> system.time(replicate(500, allData[, .(data1 = mean(data1), data2 = mean(data2)), by = .(nightNo, timestamp = cut(timestamp, breaks= "1 min"))]))
     user  system elapsed 
     1.02    0.04    1.06 

您可以使用 lubridate 到 'round' 日期,然后使用 data.table 聚合列。

library(data.table)  
library(lubridate)

可重现的数据:

text <- "timestamp  nightNo    data1     data2
'2018-10-19 19:15:00'        1        1         7
'2018-10-19 19:15:01'        1        2         8
'2018-10-19 19:15:02'        1        3         9
'2018-10-19 18:10:22'        2        4        10
'2018-10-19 18:10:23'        2        5        11 
'2018-10-19 18:10:24'        2        6        12"


allData <- read.table(text = text, header = TRUE, stringsAsFactors = FALSE)

创建data.table:

setDT(allData)

创建时间戳并将其下限到最近的分钟:

allData[, timestamp := floor_date(ymd_hms(timestamp), "minutes")]

将整数列的类型更改为 numeric:

allData[, ':='(data1 = as.numeric(data1), 
               data2 = as.numeric(data2))]

nightNo 组的方式替换数据列:

allData[, ':='(data1 = mean(data1), 
               data2 = mean(data2)),
        by = nightNo]

结果是:

             timestamp nightNo data1 data2
1: 2018-10-19 19:15:00       1     2     8
2: 2018-10-19 19:15:00       1     2     8
3: 2018-10-19 19:15:00       1     2     8
4: 2018-10-19 18:10:00       2     5    11
5: 2018-10-19 18:10:00       2     5    11
6: 2018-10-19 18:10:00       2     5    11