我对高频数据中的每个时间戳进行操作的 for 循环无效

Question

我正在使用 R 计算开放水域季节以来每个时间戳的整个湖泊温度。

我每 10 分钟在不同深度记录温度的记录器。

每个湖泊的每个数据框都有超过 100k 个条目，带有超过 10k 个不同的时间戳。

这就是我使用 for 循环解决此问题的方法。但是，代码效率极低，每个湖需要几个小时，具体取决于湖的深度（越深的湖有更多的记录器）。

下面的示例类似于我的数据。运行示例中的脚本运行速度很快，但实际数据需要几个小时。

应该有一种更有效的方法来执行此操作，具有一些应用系列功能，但不知道如何实现。

    library(rLakeAnalyzer)
    
date <- c("2000-01-01 00:00:00","2000-01-01 00:00:00","2000-01-01 00:00:00",
          "2000-01-01 00:10:00","2000-01-01 00:10:00","2000-01-01 00:10:00",
          "2000-01-01 00:20:00","2000-01-01 00:20:00","2000-01-01 00:20:00")
depth <- c(1,2,3,1,2,3,1,2,3)
temp <- c(20,12,9,14,12,11,10,7,4)

dt <- as.data.frame(cbind(temp,depth,date)) #example data frame

dptd <- c(0,1,2,3) #example depth
dpta <- c(5000,2500,1250,625) #example area per depth

datelist <- levels(as.factor(dt$date)) #'for each date in the frame...'

ldf <- list() #list to store every row for the new data frame
for(i in 1:length(datelist)){
  print(i) #to check how fast it operates
  lek <- dt[grepl(datelist[i],dt$date),] #take every date in dt
  temp <- whole.lake.temperature(wtr=lek$temp,depths=lek$depth,bthA=dpta,bthD=dptd) #function 
  date <- datelist[i] 
  ldf[[i]] <- as.data.frame(cbind(temp,date)) #make a dataframe in list with 1 row and 2 col
}

ldf <- bind_rows(ldf) #convert list of data frames to a complete data frame
ldf$temp <- as.numeric(ldf$temp)
ldf$date <- as.POSIXct(ldf$date)

plot(ldf$date,ldf$temp) #woala, I have a dataframe with the whole lake temp at every timestamp

Answer 1

如何使用 data.table，按 date 分组，然后应用 whole.lake.temperature 函数：

library(rLakeAnalyzer)
library(data.table)
date <- c("2000-01-01 00:00:00","2000-01-01 00:00:00","2000-01-01 00:00:00",
      "2000-01-01 00:10:00","2000-01-01 00:10:00","2000-01-01 00:10:00",
      "2000-01-01 00:20:00","2000-01-01 00:20:00","2000-01-01 00:20:00")
depth <- c(1,2,3,1,2,3,1,2,3)
temp <- c(20,12,9,14,12,11,10,7,4)

dt <- as.data.frame(cbind(temp,depth,date)) #example data frame

dptd <- c(0,1,2,3) #example depth
dpta <- c(5000,2500,1250,625) #example area per depth

results <- setDT(dt)[,by=date,
                     .(temp=whole.lake.temperature(wtr=temp,
                                                   depths=depth,
                                                   bthA=dpta,
                                                   bthD=dptd))]

如果不在整个数据集上进行尝试，很难判断它是否加快了速度。如果有帮助，请告诉我。

我对高频数据中的每个时间戳进行操作的 for 循环无效

My for loop operating on each timestamp in high frequency data is ineffective

for-loop

r

apply