对于 R 中的 tibbles 列表,嵌套循环执行缓慢

for nested loops performing slow with list of tibbles in R

我有一个包含 106 个 tibbles 的列表,它们是 data_sensor 内的时间序列。每个小标题都有两列,分别是日期和温度。

另一方面,我在 date_admin 中有一个包含 106 个日期的列表,其中包含我希望我的时间序列以 tibble 结尾的日期。

代码工作正常,但使用嵌套 for-loop 它需要太多时间,因为每个小标题的平均行数接近第 10000 行。

library(tidyverse)
library(dplyr)

#List nesting all the dataframes of all the xls files
files <- dir("C:/User*inals", pattern = "\.xls$", full.names = TRUE)
data_sensor <- lapply(files, read_xls)

##List nesting all the dataframes of all the xlsx files
filesx <- dir("C:/Us******ls", pattern = "\.xlsx$", full.names = TRUE)
data_generic <- lapply(filesx, read_xlsx)

idxend=vector()
for (i in seq_along(data_sensor)){
  for (j in seq_along(data_sensor[[i]][[1]])){
    if (as.Date(data_sensor[[i]][[1]][[j]]) < as.Date(date_admin[i])){
      data_sensor[[i]][[1]][[j]] = data_sensor[[i]][[1]][[j]]
    } else{ #Convert all the elements after condition to NA's
        data_sensor[[i]][[1]][[j]] = NA
        data_sensor[[i]][[2]][[j]] = NA
    }
}
#Drop all NA's
for (i in seq_along(data_sensor)){
  data_sensor[[i]] = drop_na(data_sensor[[i]])
}
}

为了阐明我的 tibbles 和矢量列表:

> data_sensor[[1]][[1]][[1]]
[1] "2018-08-07 11:00:31 UTC"
> data_sensor[[1]][[2]][[1]]
[1] 6.3
> data_sensor[[2]][[1]][[1]]
[1] "2018-08-08 11:56:05 UTC" 
#data_sensor[[index of list]][[column of tibble(date,Temperature)]][[row of tibble]]
> date_admin
  [1] "2018-10-07 UTC" "2018-12-29 UTC" "2018-12-13 UTC" "2019-08-09 UTC" "2019-10-10 UTC"
  [6] "2019-04-26 UTC" "2018-11-21 UTC" "2018-08-23 UTC" "2019-07-08 UTC" "2019-11-19 UTC"
 [11] "2019-11-07 UTC" "2018-09-05 UTC" "2018-09-03 UTC" "2018-09-24 UTC" "2018-10-11 UTC"
 [16] "2018-09-25 UTC" "2019-03-29 UTC" "2018-08-20 UTC" "2018-09-17 UTC" "2019-03-30 UTC"
 [21] "2018-11-07 UTC" "2019-01-01 UTC" "2018-08-31 UTC" "2019-03-27 UTC" "2019-11-10 UTC"
 [26] "2019-04-04 UTC" "2019-10-18 UTC" "2018-09-06 UTC" "2018-09-23 UTC" "2018-09-22 UTC"
 [31] "2019-07-22 UTC" "2018-09-04 UTC" "2019-05-17 UTC" "2018-11-05 UTC" "2018-12-09 UTC"
 [36] "2018-09-03 UTC" "2019-05-21 UTC" "2019-02-22 UTC" "2018-08-30 UTC" "2019-06-04 UTC"
 [41] "2018-09-13 UTC" "2018-10-14 UTC" "2019-11-08 UTC" "2018-08-30 UTC" "2019-04-12 UTC"
 [46] "2018-09-24 UTC" "2018-08-22 UTC" "2018-08-30 UTC" "2018-09-07 UTC" "2018-11-11 UTC"
 [51] "2018-11-01 UTC" "2018-10-01 UTC" "2018-10-22 UTC" "2018-12-03 UTC" "2019-06-06 UTC"
 [56] "2018-09-09 UTC" "2018-09-10 UTC" "2018-09-24 UTC" "2018-10-11 UTC" "2018-11-30 UTC"
 [61] "2018-09-20 UTC" "2019-11-20 UTC" "2018-10-11 UTC" "2018-10-09 UTC" "2018-09-27 UTC"
 [66] "2019-11-11 UTC" "2018-10-04 UTC" "2018-09-14 UTC" "2019-04-27 UTC" "2018-09-04 UTC"
 [71] "2018-09-11 UTC" "2018-08-14 UTC" "2018-09-01 UTC" "2018-10-01 UTC" "2018-09-25 UTC"
 [76] "2018-09-28 UTC" "2018-09-29 UTC" "2018-10-11 UTC" "2019-03-26 UTC" "2018-10-26 UTC"
 [81] "2018-11-21 UTC" "2018-12-02 UTC" "2018-09-08 UTC" "2019-01-08 UTC" "2018-11-07 UTC"
 [86] "2019-02-05 UTC" "2019-01-21 UTC" "2018-09-11 UTC" "2018-12-17 UTC" "2019-01-15 UTC"
 [91] "2018-08-28 UTC" "2019-01-08 UTC" "2019-05-14 UTC" "2019-01-21 UTC" "2018-11-12 UTC"
 [96] "2018-10-26 UTC" "2019-12-26 UTC" "2020-01-03 UTC" "2020-01-06 UTC" "2020-02-26 UTC"
[101] "2020-02-14 UTC" "2020-01-27 UTC" "2020-01-21 UTC" "2020-03-16 UTC" "2020-02-26 UTC"
[106] "2019-12-31 UTC"

data_sensor[[1]]
                   date Temperature
1   2018-08-07 11:00:31         6.3
2   2018-08-07 11:10:31        11.4
3   2018-08-07 11:20:31        12.0
4   2018-08-07 11:30:31        13.7
5   2018-08-07 11:40:31        15.6
6   2018-08-07 11:50:31        13.6
7   2018-08-07 12:00:31        12.2
8   2018-08-07 12:10:31        11.2
9   2018-08-07 12:20:31        11.6
...............................
...............................
...............................
499 2018-08-10 22:00:31         9.7
500 2018-08-10 22:10:31         9.6
 [ reached 'max' / getOption("max.print") -- omitted 8592 rows ]

通过嵌套的 for 循环清理数据需要几分钟时间。我怎样才能提高我的代码的性能?

实施答案时出错:

    > data_sensor = 
+   tibble(
+     file = paste("file",1:length(date_admin)),
+     date_admin = date_admin
+   ) %>% 
+   mutate(data_sensor = map(file, ~data_sensor))

> data_sensor
# A tibble: 106 x 3
   file    date_admin          data_sensor 
   <chr>   <dttm>              <list>      
 1 file 1  2018-10-07 00:00:00 <list [106]>
 2 file 2  2018-12-29 00:00:00 <list [106]>
 3 file 3  2018-12-13 00:00:00 <list [106]>

我的data_sensor的class在实现代码之前是list,实现之后变成:

[1] "tbl_df" "tbl" "data.frame"

该块中出现错误:

> data_sensor = data_sensor %>% 
+   group_by(file) %>% 
+   group_modify(~f(.x))
 Error in UseMethod("mutate") : 
  no applicable method for 'mutate' applied to an object of class "list" 
> class(data_sensor)
[1] "tbl_df"     "tbl"        "data.frame"
> data_sensor = data_sensor %>% 
+   group_by(file) %>% 
+   group_modify(~f(.x))
 Error in UseMethod("mutate") : 
  no applicable method for 'mutate' applied to an object of class "list"

如果您在矢量化子设置操作中转换整个事物,它应该快一个数量级,例如:

for (i in seq_along(data_sensor)){
  data_sensor[[i]]<- data_sensor[[i]][as.Date(data_sensor[[i]]$date)<as.Date(date_admin[i]),]
}

对于 loops 通常有些慢,最好避免嵌套 loops 并尽可能使用矢量化操作

ps 由于缺乏数据我无法尝试这个

绝对不要循环!!此类操作有更有效的方法。我会告诉你如何去做。但首先我需要生成一些数据。为此,我创建了两个小函数。 rndDate 将开始日期从“1/1/2018”随机化为“12/31/2020”,而 fDateSensor returns tibble 每 10 分钟一个时间序列。

rndDate = function(start_date=ymd("20180101"), end_date=ymd("20201231")){ 
  sample(seq(start_date, end_date, "days"), 1)}

fDateSensor = function(n) tibble(
    date = rndDate() + 1:n*dminutes(10),
    Temperature = rnorm(n)
  )

fDateSensor(5)

输出

# A tibble: 5 x 2
  date                Temperature
  <dttm>                    <dbl>
1 2019-09-27 00:10:00      -0.511
2 2019-09-27 00:20:00       0.531
3 2019-09-27 00:30:00       1.42 
4 2019-09-27 00:40:00       0.252
5 2019-09-27 00:50:00      -0.570

现在我要用内部tibble做一个tibble。首先,对于两个日期,date_admin.

nDateSensor = 10
set.seed(1234)
date_admin = c("2018-10-07", "2019-07-29")
data_sensor = 
  tibble(
    file = paste("file",1:length(date_admin)),
    date_admin = date_admin
  ) %>% 
  mutate(data_sensor = map(file, ~fDateSensor(nDateSensor)))
data_sensor

输出

# A tibble: 2 x 3
  file   date_admin data_sensor      
  <chr>  <chr>      <list>           
1 file 1 2018-10-07 <tibble [10 x 2]>
2 file 2 2019-07-29 <tibble [10 x 2]>

如您所见,我模拟了读取两个文件。它们的内容在变量 data_sensor 中,大小为 tibble 10x2.

data_sensor$data_sensor
[[1]]
# A tibble: 10 x 2
   date                Temperature
   <dttm>                    <dbl>
 1 2020-10-14 00:10:00      0.314 
 2 2020-10-14 00:20:00      0.359 
 3 2020-10-14 00:30:00     -0.730 
 4 2020-10-14 00:40:00      0.0357
 5 2020-10-14 00:50:00      0.113 
 6 2020-10-14 01:00:00      1.43  
 7 2020-10-14 01:10:00      0.983 
 8 2020-10-14 01:20:00     -0.622 
 9 2020-10-14 01:30:00     -0.732 
10 2020-10-14 01:40:00     -0.517 

[[2]]
# A tibble: 10 x 2
   date                Temperature
   <dttm>                    <dbl>
 1 2019-07-28 00:10:00     -0.776 
 2 2019-07-28 00:20:00      0.0645
 3 2019-07-28 00:30:00      0.959 
 4 2019-07-28 00:40:00     -0.110 
 5 2019-07-28 00:50:00     -0.511 
 6 2019-07-28 01:00:00     -0.911 
 7 2019-07-28 01:10:00     -0.837 
 8 2019-07-28 01:20:00      2.42  
 9 2019-07-28 01:30:00      0.134 
10 2019-07-28 01:40:00     -0.491 

现在是最重要的时刻。我们将构建一个函数 f 来根据您的期望修改我们的内部 tibble

f = function(data) {
  data$data_sensor[[1]] = data$data_sensor[[1]] %>% mutate(
    date = ifelse(date<data$date_admin, NA, date) %>% as_datetime(),
    Temperature = ifelse(date<data$date_admin, NA, Temperature)
  )  
  data %>% mutate(nNA = sum(is.na(data$data_sensor[[1]]$date)))
}

data_sensor = data_sensor %>% 
  group_by(file) %>% 
  group_modify(~f(.x))

data_sensor$data_sensor

输出

data_sensor$data_sensor
[[1]]
# A tibble: 10 x 2
   date                Temperature
   <dttm>                    <dbl>
 1 2020-10-14 00:10:00      0.314 
 2 2020-10-14 00:20:00      0.359 
 3 2020-10-14 00:30:00     -0.730 
 4 2020-10-14 00:40:00      0.0357
 5 2020-10-14 00:50:00      0.113 
 6 2020-10-14 01:00:00      1.43  
 7 2020-10-14 01:10:00      0.983 
 8 2020-10-14 01:20:00     -0.622 
 9 2020-10-14 01:30:00     -0.732 
10 2020-10-14 01:40:00     -0.517 

[[2]]
# A tibble: 10 x 2
   date   Temperature
   <dttm> <lgl>      
 1 NA     NA         
 2 NA     NA         
 3 NA     NA         
 4 NA     NA         
 5 NA     NA         
 6 NA     NA         
 7 NA     NA         
 8 NA     NA         
 9 NA     NA         
10 NA     NA         

如您所见,一切正常。
此外,我们的 f 函数,除了 data_sensor 突变之外,returns NA 观察的数量。

# A tibble: 2 x 4
# Groups:   file [2]
  file   date_admin data_sensor         nNA
  <chr>  <chr>      <list>            <int>
1 file 1 2018-10-07 <tibble [10 x 2]>     0
2 file 2 2019-07-29 <tibble [10 x 2]>    10

所以是时候在更大一点的数据上对其进行测试了。在这里,我使用了您的 date_admin 矢量并绘制了 106 tibbles 个,每个包含 100000 个观察值!

date_admin = c(
  "2018-10-07", "2018-12-29", "2018-12-13", "2019-08-09", "2019-10-10",
  "2019-04-26", "2018-11-21", "2018-08-23", "2019-07-08", "2019-11-19",
  "2019-11-07", "2018-09-05", "2018-09-03", "2018-09-24", "2018-10-11",
  "2018-09-25", "2019-03-29", "2018-08-20", "2018-09-17", "2019-03-30",
  "2018-11-07", "2019-01-01", "2018-08-31", "2019-03-27", "2019-11-10",
  "2019-04-04", "2019-10-18", "2018-09-06", "2018-09-23", "2018-09-22",
  "2019-07-22", "2018-09-04", "2019-05-17", "2018-11-05", "2018-12-09",
  "2018-09-03", "2019-05-21", "2019-02-22", "2018-08-30", "2019-06-04",
  "2018-09-13", "2018-10-14", "2019-11-08", "2018-08-30", "2019-04-12",
  "2018-09-24", "2018-08-22", "2018-08-30", "2018-09-07", "2018-11-11",
  "2018-11-01", "2018-10-01", "2018-10-22", "2018-12-03", "2019-06-06",
  "2018-09-09", "2018-09-10", "2018-09-24", "2018-10-11", "2018-11-30",
  "2018-09-20", "2019-11-20", "2018-10-11", "2018-10-09", "2018-09-27",
  "2019-11-11", "2018-10-04", "2018-09-14", "2019-04-27", "2018-09-04",
  "2018-09-11", "2018-08-14", "2018-09-01", "2018-10-01", "2018-09-25",
  "2018-09-28", "2018-09-29", "2018-10-11", "2019-03-26", "2018-10-26",
  "2018-11-21", "2018-12-02", "2018-09-08", "2019-01-08", "2018-11-07",
  "2019-02-05", "2019-01-21", "2018-09-11", "2018-12-17", "2019-01-15",
  "2018-08-28", "2019-01-08", "2019-05-14", "2019-01-21", "2018-11-12",
  "2018-10-26", "2019-12-26", "2020-01-03", "2020-01-06", "2020-02-26",
  "2020-02-14", "2020-01-27", "2020-01-21", "2020-03-16", "2020-02-26",
  "2019-12-31")

nDateSensor = 100000
set.seed(1234)

data_sensor = 
  tibble(
    file = paste("file",1:length(date_admin)),
    date_admin = date_admin
  ) %>% 
  mutate(data_sensor = map(file, ~fDateSensor(nDateSensor)))

输出

 data_sensor
# A tibble: 106 x 3
   file    date_admin data_sensor           
   <chr>   <chr>      <list>                
 1 file 1  2018-10-07 <tibble [100,000 x 2]>
 2 file 2  2018-12-29 <tibble [100,000 x 2]>
 3 file 3  2018-12-13 <tibble [100,000 x 2]>
 4 file 4  2019-08-09 <tibble [100,000 x 2]>
 5 file 5  2019-10-10 <tibble [100,000 x 2]>
 6 file 6  2019-04-26 <tibble [100,000 x 2]>
 7 file 7  2018-11-21 <tibble [100,000 x 2]>
 8 file 8  2018-08-23 <tibble [100,000 x 2]>
 9 file 9  2019-07-08 <tibble [100,000 x 2]>
10 file 10 2019-11-19 <tibble [100,000 x 2]>
# ... with 96 more rows

突变时间到了。我们将立即测量需要多长时间。

start_time =Sys.time()
data_sensor = data_sensor %>% 
  group_by(file) %>% 
  group_modify(~f(.x))
Sys.time()-start_time

我用了 2.3 秒。不知道大家有没有预料到这样的时间,看来是个不错的结果了。
让我们看看我们的 data_sensor 长什么样。

# A tibble: 106 x 4
# Groups:   file [106]
   file     date_admin data_sensor              nNA
   <chr>    <chr>      <list>                 <int>
 1 file 1   2018-10-07 <tibble [100,000 x 2]>     0
 2 file 10  2019-11-19 <tibble [100,000 x 2]> 19001
 3 file 100 2020-02-26 <tibble [100,000 x 2]> 95897
 4 file 101 2020-02-14 <tibble [100,000 x 2]>  7769
 5 file 102 2020-01-27 <tibble [100,000 x 2]> 99497
 6 file 103 2020-01-21 <tibble [100,000 x 2]>     0
 7 file 104 2020-03-16 <tibble [100,000 x 2]> 50969
 8 file 105 2020-02-26 <tibble [100,000 x 2]>     0
 9 file 106 2019-12-31 <tibble [100,000 x 2]> 13673
10 file 11  2019-11-07 <tibble [100,000 x 2]> 16697
# ... with 96 more rows

如您所见,部分数据已更改为NA。所以一切正常。
您所要做的就是将 xls 文件名读入 data_sensor,然后使用 group_by (file) group_modify 将数据加载到变量 data_sensor。祝你好运!