对于 R 中的 tibbles 列表,嵌套循环执行缓慢
for nested loops performing slow with list of tibbles in R
我有一个包含 106 个 tibbles 的列表,它们是 data_sensor 内的时间序列。每个小标题都有两列,分别是日期和温度。
另一方面,我在 date_admin 中有一个包含 106 个日期的列表,其中包含我希望我的时间序列以 tibble 结尾的日期。
代码工作正常,但使用嵌套 for-loop 它需要太多时间,因为每个小标题的平均行数接近第 10000 行。
library(tidyverse)
library(dplyr)
#List nesting all the dataframes of all the xls files
files <- dir("C:/User*inals", pattern = "\.xls$", full.names = TRUE)
data_sensor <- lapply(files, read_xls)
##List nesting all the dataframes of all the xlsx files
filesx <- dir("C:/Us******ls", pattern = "\.xlsx$", full.names = TRUE)
data_generic <- lapply(filesx, read_xlsx)
idxend=vector()
for (i in seq_along(data_sensor)){
for (j in seq_along(data_sensor[[i]][[1]])){
if (as.Date(data_sensor[[i]][[1]][[j]]) < as.Date(date_admin[i])){
data_sensor[[i]][[1]][[j]] = data_sensor[[i]][[1]][[j]]
} else{ #Convert all the elements after condition to NA's
data_sensor[[i]][[1]][[j]] = NA
data_sensor[[i]][[2]][[j]] = NA
}
}
#Drop all NA's
for (i in seq_along(data_sensor)){
data_sensor[[i]] = drop_na(data_sensor[[i]])
}
}
为了阐明我的 tibbles 和矢量列表:
> data_sensor[[1]][[1]][[1]]
[1] "2018-08-07 11:00:31 UTC"
> data_sensor[[1]][[2]][[1]]
[1] 6.3
> data_sensor[[2]][[1]][[1]]
[1] "2018-08-08 11:56:05 UTC"
#data_sensor[[index of list]][[column of tibble(date,Temperature)]][[row of tibble]]
> date_admin
[1] "2018-10-07 UTC" "2018-12-29 UTC" "2018-12-13 UTC" "2019-08-09 UTC" "2019-10-10 UTC"
[6] "2019-04-26 UTC" "2018-11-21 UTC" "2018-08-23 UTC" "2019-07-08 UTC" "2019-11-19 UTC"
[11] "2019-11-07 UTC" "2018-09-05 UTC" "2018-09-03 UTC" "2018-09-24 UTC" "2018-10-11 UTC"
[16] "2018-09-25 UTC" "2019-03-29 UTC" "2018-08-20 UTC" "2018-09-17 UTC" "2019-03-30 UTC"
[21] "2018-11-07 UTC" "2019-01-01 UTC" "2018-08-31 UTC" "2019-03-27 UTC" "2019-11-10 UTC"
[26] "2019-04-04 UTC" "2019-10-18 UTC" "2018-09-06 UTC" "2018-09-23 UTC" "2018-09-22 UTC"
[31] "2019-07-22 UTC" "2018-09-04 UTC" "2019-05-17 UTC" "2018-11-05 UTC" "2018-12-09 UTC"
[36] "2018-09-03 UTC" "2019-05-21 UTC" "2019-02-22 UTC" "2018-08-30 UTC" "2019-06-04 UTC"
[41] "2018-09-13 UTC" "2018-10-14 UTC" "2019-11-08 UTC" "2018-08-30 UTC" "2019-04-12 UTC"
[46] "2018-09-24 UTC" "2018-08-22 UTC" "2018-08-30 UTC" "2018-09-07 UTC" "2018-11-11 UTC"
[51] "2018-11-01 UTC" "2018-10-01 UTC" "2018-10-22 UTC" "2018-12-03 UTC" "2019-06-06 UTC"
[56] "2018-09-09 UTC" "2018-09-10 UTC" "2018-09-24 UTC" "2018-10-11 UTC" "2018-11-30 UTC"
[61] "2018-09-20 UTC" "2019-11-20 UTC" "2018-10-11 UTC" "2018-10-09 UTC" "2018-09-27 UTC"
[66] "2019-11-11 UTC" "2018-10-04 UTC" "2018-09-14 UTC" "2019-04-27 UTC" "2018-09-04 UTC"
[71] "2018-09-11 UTC" "2018-08-14 UTC" "2018-09-01 UTC" "2018-10-01 UTC" "2018-09-25 UTC"
[76] "2018-09-28 UTC" "2018-09-29 UTC" "2018-10-11 UTC" "2019-03-26 UTC" "2018-10-26 UTC"
[81] "2018-11-21 UTC" "2018-12-02 UTC" "2018-09-08 UTC" "2019-01-08 UTC" "2018-11-07 UTC"
[86] "2019-02-05 UTC" "2019-01-21 UTC" "2018-09-11 UTC" "2018-12-17 UTC" "2019-01-15 UTC"
[91] "2018-08-28 UTC" "2019-01-08 UTC" "2019-05-14 UTC" "2019-01-21 UTC" "2018-11-12 UTC"
[96] "2018-10-26 UTC" "2019-12-26 UTC" "2020-01-03 UTC" "2020-01-06 UTC" "2020-02-26 UTC"
[101] "2020-02-14 UTC" "2020-01-27 UTC" "2020-01-21 UTC" "2020-03-16 UTC" "2020-02-26 UTC"
[106] "2019-12-31 UTC"
data_sensor[[1]]
date Temperature
1 2018-08-07 11:00:31 6.3
2 2018-08-07 11:10:31 11.4
3 2018-08-07 11:20:31 12.0
4 2018-08-07 11:30:31 13.7
5 2018-08-07 11:40:31 15.6
6 2018-08-07 11:50:31 13.6
7 2018-08-07 12:00:31 12.2
8 2018-08-07 12:10:31 11.2
9 2018-08-07 12:20:31 11.6
...............................
...............................
...............................
499 2018-08-10 22:00:31 9.7
500 2018-08-10 22:10:31 9.6
[ reached 'max' / getOption("max.print") -- omitted 8592 rows ]
通过嵌套的 for 循环清理数据需要几分钟时间。我怎样才能提高我的代码的性能?
实施答案时出错:
> data_sensor =
+ tibble(
+ file = paste("file",1:length(date_admin)),
+ date_admin = date_admin
+ ) %>%
+ mutate(data_sensor = map(file, ~data_sensor))
> data_sensor
# A tibble: 106 x 3
file date_admin data_sensor
<chr> <dttm> <list>
1 file 1 2018-10-07 00:00:00 <list [106]>
2 file 2 2018-12-29 00:00:00 <list [106]>
3 file 3 2018-12-13 00:00:00 <list [106]>
我的data_sensor的class在实现代码之前是list
,实现之后变成:
[1] "tbl_df" "tbl" "data.frame"
该块中出现错误:
> data_sensor = data_sensor %>%
+ group_by(file) %>%
+ group_modify(~f(.x))
Error in UseMethod("mutate") :
no applicable method for 'mutate' applied to an object of class "list"
> class(data_sensor)
[1] "tbl_df" "tbl" "data.frame"
> data_sensor = data_sensor %>%
+ group_by(file) %>%
+ group_modify(~f(.x))
Error in UseMethod("mutate") :
no applicable method for 'mutate' applied to an object of class "list"
如果您在矢量化子设置操作中转换整个事物,它应该快一个数量级,例如:
for (i in seq_along(data_sensor)){
data_sensor[[i]]<- data_sensor[[i]][as.Date(data_sensor[[i]]$date)<as.Date(date_admin[i]),]
}
对于 loops 通常有些慢,最好避免嵌套 loops 并尽可能使用矢量化操作
ps 由于缺乏数据我无法尝试这个
绝对不要循环!!此类操作有更有效的方法。我会告诉你如何去做。但首先我需要生成一些数据。为此,我创建了两个小函数。 rndDate
将开始日期从“1/1/2018”随机化为“12/31/2020”,而 fDateSensor
returns tibble
每 10 分钟一个时间序列。
rndDate = function(start_date=ymd("20180101"), end_date=ymd("20201231")){
sample(seq(start_date, end_date, "days"), 1)}
fDateSensor = function(n) tibble(
date = rndDate() + 1:n*dminutes(10),
Temperature = rnorm(n)
)
fDateSensor(5)
输出
# A tibble: 5 x 2
date Temperature
<dttm> <dbl>
1 2019-09-27 00:10:00 -0.511
2 2019-09-27 00:20:00 0.531
3 2019-09-27 00:30:00 1.42
4 2019-09-27 00:40:00 0.252
5 2019-09-27 00:50:00 -0.570
现在我要用内部tibble做一个tibble。首先,对于两个日期,date_admin
.
nDateSensor = 10
set.seed(1234)
date_admin = c("2018-10-07", "2019-07-29")
data_sensor =
tibble(
file = paste("file",1:length(date_admin)),
date_admin = date_admin
) %>%
mutate(data_sensor = map(file, ~fDateSensor(nDateSensor)))
data_sensor
输出
# A tibble: 2 x 3
file date_admin data_sensor
<chr> <chr> <list>
1 file 1 2018-10-07 <tibble [10 x 2]>
2 file 2 2019-07-29 <tibble [10 x 2]>
如您所见,我模拟了读取两个文件。它们的内容在变量 data_sensor
中,大小为 tibble
10x2.
data_sensor$data_sensor
[[1]]
# A tibble: 10 x 2
date Temperature
<dttm> <dbl>
1 2020-10-14 00:10:00 0.314
2 2020-10-14 00:20:00 0.359
3 2020-10-14 00:30:00 -0.730
4 2020-10-14 00:40:00 0.0357
5 2020-10-14 00:50:00 0.113
6 2020-10-14 01:00:00 1.43
7 2020-10-14 01:10:00 0.983
8 2020-10-14 01:20:00 -0.622
9 2020-10-14 01:30:00 -0.732
10 2020-10-14 01:40:00 -0.517
[[2]]
# A tibble: 10 x 2
date Temperature
<dttm> <dbl>
1 2019-07-28 00:10:00 -0.776
2 2019-07-28 00:20:00 0.0645
3 2019-07-28 00:30:00 0.959
4 2019-07-28 00:40:00 -0.110
5 2019-07-28 00:50:00 -0.511
6 2019-07-28 01:00:00 -0.911
7 2019-07-28 01:10:00 -0.837
8 2019-07-28 01:20:00 2.42
9 2019-07-28 01:30:00 0.134
10 2019-07-28 01:40:00 -0.491
现在是最重要的时刻。我们将构建一个函数 f
来根据您的期望修改我们的内部 tibble
。
f = function(data) {
data$data_sensor[[1]] = data$data_sensor[[1]] %>% mutate(
date = ifelse(date<data$date_admin, NA, date) %>% as_datetime(),
Temperature = ifelse(date<data$date_admin, NA, Temperature)
)
data %>% mutate(nNA = sum(is.na(data$data_sensor[[1]]$date)))
}
data_sensor = data_sensor %>%
group_by(file) %>%
group_modify(~f(.x))
data_sensor$data_sensor
输出
data_sensor$data_sensor
[[1]]
# A tibble: 10 x 2
date Temperature
<dttm> <dbl>
1 2020-10-14 00:10:00 0.314
2 2020-10-14 00:20:00 0.359
3 2020-10-14 00:30:00 -0.730
4 2020-10-14 00:40:00 0.0357
5 2020-10-14 00:50:00 0.113
6 2020-10-14 01:00:00 1.43
7 2020-10-14 01:10:00 0.983
8 2020-10-14 01:20:00 -0.622
9 2020-10-14 01:30:00 -0.732
10 2020-10-14 01:40:00 -0.517
[[2]]
# A tibble: 10 x 2
date Temperature
<dttm> <lgl>
1 NA NA
2 NA NA
3 NA NA
4 NA NA
5 NA NA
6 NA NA
7 NA NA
8 NA NA
9 NA NA
10 NA NA
如您所见,一切正常。
此外,我们的 f
函数,除了 data_sensor
突变之外,returns NA
观察的数量。
# A tibble: 2 x 4
# Groups: file [2]
file date_admin data_sensor nNA
<chr> <chr> <list> <int>
1 file 1 2018-10-07 <tibble [10 x 2]> 0
2 file 2 2019-07-29 <tibble [10 x 2]> 10
所以是时候在更大一点的数据上对其进行测试了。在这里,我使用了您的 date_admin
矢量并绘制了 106 tibbles
个,每个包含 100000 个观察值!
date_admin = c(
"2018-10-07", "2018-12-29", "2018-12-13", "2019-08-09", "2019-10-10",
"2019-04-26", "2018-11-21", "2018-08-23", "2019-07-08", "2019-11-19",
"2019-11-07", "2018-09-05", "2018-09-03", "2018-09-24", "2018-10-11",
"2018-09-25", "2019-03-29", "2018-08-20", "2018-09-17", "2019-03-30",
"2018-11-07", "2019-01-01", "2018-08-31", "2019-03-27", "2019-11-10",
"2019-04-04", "2019-10-18", "2018-09-06", "2018-09-23", "2018-09-22",
"2019-07-22", "2018-09-04", "2019-05-17", "2018-11-05", "2018-12-09",
"2018-09-03", "2019-05-21", "2019-02-22", "2018-08-30", "2019-06-04",
"2018-09-13", "2018-10-14", "2019-11-08", "2018-08-30", "2019-04-12",
"2018-09-24", "2018-08-22", "2018-08-30", "2018-09-07", "2018-11-11",
"2018-11-01", "2018-10-01", "2018-10-22", "2018-12-03", "2019-06-06",
"2018-09-09", "2018-09-10", "2018-09-24", "2018-10-11", "2018-11-30",
"2018-09-20", "2019-11-20", "2018-10-11", "2018-10-09", "2018-09-27",
"2019-11-11", "2018-10-04", "2018-09-14", "2019-04-27", "2018-09-04",
"2018-09-11", "2018-08-14", "2018-09-01", "2018-10-01", "2018-09-25",
"2018-09-28", "2018-09-29", "2018-10-11", "2019-03-26", "2018-10-26",
"2018-11-21", "2018-12-02", "2018-09-08", "2019-01-08", "2018-11-07",
"2019-02-05", "2019-01-21", "2018-09-11", "2018-12-17", "2019-01-15",
"2018-08-28", "2019-01-08", "2019-05-14", "2019-01-21", "2018-11-12",
"2018-10-26", "2019-12-26", "2020-01-03", "2020-01-06", "2020-02-26",
"2020-02-14", "2020-01-27", "2020-01-21", "2020-03-16", "2020-02-26",
"2019-12-31")
nDateSensor = 100000
set.seed(1234)
data_sensor =
tibble(
file = paste("file",1:length(date_admin)),
date_admin = date_admin
) %>%
mutate(data_sensor = map(file, ~fDateSensor(nDateSensor)))
输出
data_sensor
# A tibble: 106 x 3
file date_admin data_sensor
<chr> <chr> <list>
1 file 1 2018-10-07 <tibble [100,000 x 2]>
2 file 2 2018-12-29 <tibble [100,000 x 2]>
3 file 3 2018-12-13 <tibble [100,000 x 2]>
4 file 4 2019-08-09 <tibble [100,000 x 2]>
5 file 5 2019-10-10 <tibble [100,000 x 2]>
6 file 6 2019-04-26 <tibble [100,000 x 2]>
7 file 7 2018-11-21 <tibble [100,000 x 2]>
8 file 8 2018-08-23 <tibble [100,000 x 2]>
9 file 9 2019-07-08 <tibble [100,000 x 2]>
10 file 10 2019-11-19 <tibble [100,000 x 2]>
# ... with 96 more rows
突变时间到了。我们将立即测量需要多长时间。
start_time =Sys.time()
data_sensor = data_sensor %>%
group_by(file) %>%
group_modify(~f(.x))
Sys.time()-start_time
我用了 2.3 秒。不知道大家有没有预料到这样的时间,看来是个不错的结果了。
让我们看看我们的 data_sensor
长什么样。
# A tibble: 106 x 4
# Groups: file [106]
file date_admin data_sensor nNA
<chr> <chr> <list> <int>
1 file 1 2018-10-07 <tibble [100,000 x 2]> 0
2 file 10 2019-11-19 <tibble [100,000 x 2]> 19001
3 file 100 2020-02-26 <tibble [100,000 x 2]> 95897
4 file 101 2020-02-14 <tibble [100,000 x 2]> 7769
5 file 102 2020-01-27 <tibble [100,000 x 2]> 99497
6 file 103 2020-01-21 <tibble [100,000 x 2]> 0
7 file 104 2020-03-16 <tibble [100,000 x 2]> 50969
8 file 105 2020-02-26 <tibble [100,000 x 2]> 0
9 file 106 2019-12-31 <tibble [100,000 x 2]> 13673
10 file 11 2019-11-07 <tibble [100,000 x 2]> 16697
# ... with 96 more rows
如您所见,部分数据已更改为NA
。所以一切正常。
您所要做的就是将 xls 文件名读入 data_sensor
,然后使用 group_by (file)
和 group_modify
将数据加载到变量 data_sensor
。祝你好运!
我有一个包含 106 个 tibbles 的列表,它们是 data_sensor 内的时间序列。每个小标题都有两列,分别是日期和温度。
另一方面,我在 date_admin 中有一个包含 106 个日期的列表,其中包含我希望我的时间序列以 tibble 结尾的日期。
代码工作正常,但使用嵌套 for-loop 它需要太多时间,因为每个小标题的平均行数接近第 10000 行。
library(tidyverse)
library(dplyr)
#List nesting all the dataframes of all the xls files
files <- dir("C:/User*inals", pattern = "\.xls$", full.names = TRUE)
data_sensor <- lapply(files, read_xls)
##List nesting all the dataframes of all the xlsx files
filesx <- dir("C:/Us******ls", pattern = "\.xlsx$", full.names = TRUE)
data_generic <- lapply(filesx, read_xlsx)
idxend=vector()
for (i in seq_along(data_sensor)){
for (j in seq_along(data_sensor[[i]][[1]])){
if (as.Date(data_sensor[[i]][[1]][[j]]) < as.Date(date_admin[i])){
data_sensor[[i]][[1]][[j]] = data_sensor[[i]][[1]][[j]]
} else{ #Convert all the elements after condition to NA's
data_sensor[[i]][[1]][[j]] = NA
data_sensor[[i]][[2]][[j]] = NA
}
}
#Drop all NA's
for (i in seq_along(data_sensor)){
data_sensor[[i]] = drop_na(data_sensor[[i]])
}
}
为了阐明我的 tibbles 和矢量列表:
> data_sensor[[1]][[1]][[1]]
[1] "2018-08-07 11:00:31 UTC"
> data_sensor[[1]][[2]][[1]]
[1] 6.3
> data_sensor[[2]][[1]][[1]]
[1] "2018-08-08 11:56:05 UTC"
#data_sensor[[index of list]][[column of tibble(date,Temperature)]][[row of tibble]]
> date_admin
[1] "2018-10-07 UTC" "2018-12-29 UTC" "2018-12-13 UTC" "2019-08-09 UTC" "2019-10-10 UTC"
[6] "2019-04-26 UTC" "2018-11-21 UTC" "2018-08-23 UTC" "2019-07-08 UTC" "2019-11-19 UTC"
[11] "2019-11-07 UTC" "2018-09-05 UTC" "2018-09-03 UTC" "2018-09-24 UTC" "2018-10-11 UTC"
[16] "2018-09-25 UTC" "2019-03-29 UTC" "2018-08-20 UTC" "2018-09-17 UTC" "2019-03-30 UTC"
[21] "2018-11-07 UTC" "2019-01-01 UTC" "2018-08-31 UTC" "2019-03-27 UTC" "2019-11-10 UTC"
[26] "2019-04-04 UTC" "2019-10-18 UTC" "2018-09-06 UTC" "2018-09-23 UTC" "2018-09-22 UTC"
[31] "2019-07-22 UTC" "2018-09-04 UTC" "2019-05-17 UTC" "2018-11-05 UTC" "2018-12-09 UTC"
[36] "2018-09-03 UTC" "2019-05-21 UTC" "2019-02-22 UTC" "2018-08-30 UTC" "2019-06-04 UTC"
[41] "2018-09-13 UTC" "2018-10-14 UTC" "2019-11-08 UTC" "2018-08-30 UTC" "2019-04-12 UTC"
[46] "2018-09-24 UTC" "2018-08-22 UTC" "2018-08-30 UTC" "2018-09-07 UTC" "2018-11-11 UTC"
[51] "2018-11-01 UTC" "2018-10-01 UTC" "2018-10-22 UTC" "2018-12-03 UTC" "2019-06-06 UTC"
[56] "2018-09-09 UTC" "2018-09-10 UTC" "2018-09-24 UTC" "2018-10-11 UTC" "2018-11-30 UTC"
[61] "2018-09-20 UTC" "2019-11-20 UTC" "2018-10-11 UTC" "2018-10-09 UTC" "2018-09-27 UTC"
[66] "2019-11-11 UTC" "2018-10-04 UTC" "2018-09-14 UTC" "2019-04-27 UTC" "2018-09-04 UTC"
[71] "2018-09-11 UTC" "2018-08-14 UTC" "2018-09-01 UTC" "2018-10-01 UTC" "2018-09-25 UTC"
[76] "2018-09-28 UTC" "2018-09-29 UTC" "2018-10-11 UTC" "2019-03-26 UTC" "2018-10-26 UTC"
[81] "2018-11-21 UTC" "2018-12-02 UTC" "2018-09-08 UTC" "2019-01-08 UTC" "2018-11-07 UTC"
[86] "2019-02-05 UTC" "2019-01-21 UTC" "2018-09-11 UTC" "2018-12-17 UTC" "2019-01-15 UTC"
[91] "2018-08-28 UTC" "2019-01-08 UTC" "2019-05-14 UTC" "2019-01-21 UTC" "2018-11-12 UTC"
[96] "2018-10-26 UTC" "2019-12-26 UTC" "2020-01-03 UTC" "2020-01-06 UTC" "2020-02-26 UTC"
[101] "2020-02-14 UTC" "2020-01-27 UTC" "2020-01-21 UTC" "2020-03-16 UTC" "2020-02-26 UTC"
[106] "2019-12-31 UTC"
data_sensor[[1]]
date Temperature
1 2018-08-07 11:00:31 6.3
2 2018-08-07 11:10:31 11.4
3 2018-08-07 11:20:31 12.0
4 2018-08-07 11:30:31 13.7
5 2018-08-07 11:40:31 15.6
6 2018-08-07 11:50:31 13.6
7 2018-08-07 12:00:31 12.2
8 2018-08-07 12:10:31 11.2
9 2018-08-07 12:20:31 11.6
...............................
...............................
...............................
499 2018-08-10 22:00:31 9.7
500 2018-08-10 22:10:31 9.6
[ reached 'max' / getOption("max.print") -- omitted 8592 rows ]
通过嵌套的 for 循环清理数据需要几分钟时间。我怎样才能提高我的代码的性能?
实施答案时出错:
> data_sensor =
+ tibble(
+ file = paste("file",1:length(date_admin)),
+ date_admin = date_admin
+ ) %>%
+ mutate(data_sensor = map(file, ~data_sensor))
> data_sensor
# A tibble: 106 x 3
file date_admin data_sensor
<chr> <dttm> <list>
1 file 1 2018-10-07 00:00:00 <list [106]>
2 file 2 2018-12-29 00:00:00 <list [106]>
3 file 3 2018-12-13 00:00:00 <list [106]>
我的data_sensor的class在实现代码之前是list
,实现之后变成:
[1] "tbl_df" "tbl" "data.frame"
该块中出现错误:
> data_sensor = data_sensor %>%
+ group_by(file) %>%
+ group_modify(~f(.x))
Error in UseMethod("mutate") :
no applicable method for 'mutate' applied to an object of class "list"
> class(data_sensor)
[1] "tbl_df" "tbl" "data.frame"
> data_sensor = data_sensor %>%
+ group_by(file) %>%
+ group_modify(~f(.x))
Error in UseMethod("mutate") :
no applicable method for 'mutate' applied to an object of class "list"
如果您在矢量化子设置操作中转换整个事物,它应该快一个数量级,例如:
for (i in seq_along(data_sensor)){
data_sensor[[i]]<- data_sensor[[i]][as.Date(data_sensor[[i]]$date)<as.Date(date_admin[i]),]
}
对于 loops 通常有些慢,最好避免嵌套 loops 并尽可能使用矢量化操作
ps 由于缺乏数据我无法尝试这个
绝对不要循环!!此类操作有更有效的方法。我会告诉你如何去做。但首先我需要生成一些数据。为此,我创建了两个小函数。 rndDate
将开始日期从“1/1/2018”随机化为“12/31/2020”,而 fDateSensor
returns tibble
每 10 分钟一个时间序列。
rndDate = function(start_date=ymd("20180101"), end_date=ymd("20201231")){
sample(seq(start_date, end_date, "days"), 1)}
fDateSensor = function(n) tibble(
date = rndDate() + 1:n*dminutes(10),
Temperature = rnorm(n)
)
fDateSensor(5)
输出
# A tibble: 5 x 2
date Temperature
<dttm> <dbl>
1 2019-09-27 00:10:00 -0.511
2 2019-09-27 00:20:00 0.531
3 2019-09-27 00:30:00 1.42
4 2019-09-27 00:40:00 0.252
5 2019-09-27 00:50:00 -0.570
现在我要用内部tibble做一个tibble。首先,对于两个日期,date_admin
.
nDateSensor = 10
set.seed(1234)
date_admin = c("2018-10-07", "2019-07-29")
data_sensor =
tibble(
file = paste("file",1:length(date_admin)),
date_admin = date_admin
) %>%
mutate(data_sensor = map(file, ~fDateSensor(nDateSensor)))
data_sensor
输出
# A tibble: 2 x 3
file date_admin data_sensor
<chr> <chr> <list>
1 file 1 2018-10-07 <tibble [10 x 2]>
2 file 2 2019-07-29 <tibble [10 x 2]>
如您所见,我模拟了读取两个文件。它们的内容在变量 data_sensor
中,大小为 tibble
10x2.
data_sensor$data_sensor
[[1]]
# A tibble: 10 x 2
date Temperature
<dttm> <dbl>
1 2020-10-14 00:10:00 0.314
2 2020-10-14 00:20:00 0.359
3 2020-10-14 00:30:00 -0.730
4 2020-10-14 00:40:00 0.0357
5 2020-10-14 00:50:00 0.113
6 2020-10-14 01:00:00 1.43
7 2020-10-14 01:10:00 0.983
8 2020-10-14 01:20:00 -0.622
9 2020-10-14 01:30:00 -0.732
10 2020-10-14 01:40:00 -0.517
[[2]]
# A tibble: 10 x 2
date Temperature
<dttm> <dbl>
1 2019-07-28 00:10:00 -0.776
2 2019-07-28 00:20:00 0.0645
3 2019-07-28 00:30:00 0.959
4 2019-07-28 00:40:00 -0.110
5 2019-07-28 00:50:00 -0.511
6 2019-07-28 01:00:00 -0.911
7 2019-07-28 01:10:00 -0.837
8 2019-07-28 01:20:00 2.42
9 2019-07-28 01:30:00 0.134
10 2019-07-28 01:40:00 -0.491
现在是最重要的时刻。我们将构建一个函数 f
来根据您的期望修改我们的内部 tibble
。
f = function(data) {
data$data_sensor[[1]] = data$data_sensor[[1]] %>% mutate(
date = ifelse(date<data$date_admin, NA, date) %>% as_datetime(),
Temperature = ifelse(date<data$date_admin, NA, Temperature)
)
data %>% mutate(nNA = sum(is.na(data$data_sensor[[1]]$date)))
}
data_sensor = data_sensor %>%
group_by(file) %>%
group_modify(~f(.x))
data_sensor$data_sensor
输出
data_sensor$data_sensor
[[1]]
# A tibble: 10 x 2
date Temperature
<dttm> <dbl>
1 2020-10-14 00:10:00 0.314
2 2020-10-14 00:20:00 0.359
3 2020-10-14 00:30:00 -0.730
4 2020-10-14 00:40:00 0.0357
5 2020-10-14 00:50:00 0.113
6 2020-10-14 01:00:00 1.43
7 2020-10-14 01:10:00 0.983
8 2020-10-14 01:20:00 -0.622
9 2020-10-14 01:30:00 -0.732
10 2020-10-14 01:40:00 -0.517
[[2]]
# A tibble: 10 x 2
date Temperature
<dttm> <lgl>
1 NA NA
2 NA NA
3 NA NA
4 NA NA
5 NA NA
6 NA NA
7 NA NA
8 NA NA
9 NA NA
10 NA NA
如您所见,一切正常。
此外,我们的 f
函数,除了 data_sensor
突变之外,returns NA
观察的数量。
# A tibble: 2 x 4
# Groups: file [2]
file date_admin data_sensor nNA
<chr> <chr> <list> <int>
1 file 1 2018-10-07 <tibble [10 x 2]> 0
2 file 2 2019-07-29 <tibble [10 x 2]> 10
所以是时候在更大一点的数据上对其进行测试了。在这里,我使用了您的 date_admin
矢量并绘制了 106 tibbles
个,每个包含 100000 个观察值!
date_admin = c(
"2018-10-07", "2018-12-29", "2018-12-13", "2019-08-09", "2019-10-10",
"2019-04-26", "2018-11-21", "2018-08-23", "2019-07-08", "2019-11-19",
"2019-11-07", "2018-09-05", "2018-09-03", "2018-09-24", "2018-10-11",
"2018-09-25", "2019-03-29", "2018-08-20", "2018-09-17", "2019-03-30",
"2018-11-07", "2019-01-01", "2018-08-31", "2019-03-27", "2019-11-10",
"2019-04-04", "2019-10-18", "2018-09-06", "2018-09-23", "2018-09-22",
"2019-07-22", "2018-09-04", "2019-05-17", "2018-11-05", "2018-12-09",
"2018-09-03", "2019-05-21", "2019-02-22", "2018-08-30", "2019-06-04",
"2018-09-13", "2018-10-14", "2019-11-08", "2018-08-30", "2019-04-12",
"2018-09-24", "2018-08-22", "2018-08-30", "2018-09-07", "2018-11-11",
"2018-11-01", "2018-10-01", "2018-10-22", "2018-12-03", "2019-06-06",
"2018-09-09", "2018-09-10", "2018-09-24", "2018-10-11", "2018-11-30",
"2018-09-20", "2019-11-20", "2018-10-11", "2018-10-09", "2018-09-27",
"2019-11-11", "2018-10-04", "2018-09-14", "2019-04-27", "2018-09-04",
"2018-09-11", "2018-08-14", "2018-09-01", "2018-10-01", "2018-09-25",
"2018-09-28", "2018-09-29", "2018-10-11", "2019-03-26", "2018-10-26",
"2018-11-21", "2018-12-02", "2018-09-08", "2019-01-08", "2018-11-07",
"2019-02-05", "2019-01-21", "2018-09-11", "2018-12-17", "2019-01-15",
"2018-08-28", "2019-01-08", "2019-05-14", "2019-01-21", "2018-11-12",
"2018-10-26", "2019-12-26", "2020-01-03", "2020-01-06", "2020-02-26",
"2020-02-14", "2020-01-27", "2020-01-21", "2020-03-16", "2020-02-26",
"2019-12-31")
nDateSensor = 100000
set.seed(1234)
data_sensor =
tibble(
file = paste("file",1:length(date_admin)),
date_admin = date_admin
) %>%
mutate(data_sensor = map(file, ~fDateSensor(nDateSensor)))
输出
data_sensor
# A tibble: 106 x 3
file date_admin data_sensor
<chr> <chr> <list>
1 file 1 2018-10-07 <tibble [100,000 x 2]>
2 file 2 2018-12-29 <tibble [100,000 x 2]>
3 file 3 2018-12-13 <tibble [100,000 x 2]>
4 file 4 2019-08-09 <tibble [100,000 x 2]>
5 file 5 2019-10-10 <tibble [100,000 x 2]>
6 file 6 2019-04-26 <tibble [100,000 x 2]>
7 file 7 2018-11-21 <tibble [100,000 x 2]>
8 file 8 2018-08-23 <tibble [100,000 x 2]>
9 file 9 2019-07-08 <tibble [100,000 x 2]>
10 file 10 2019-11-19 <tibble [100,000 x 2]>
# ... with 96 more rows
突变时间到了。我们将立即测量需要多长时间。
start_time =Sys.time()
data_sensor = data_sensor %>%
group_by(file) %>%
group_modify(~f(.x))
Sys.time()-start_time
我用了 2.3 秒。不知道大家有没有预料到这样的时间,看来是个不错的结果了。
让我们看看我们的 data_sensor
长什么样。
# A tibble: 106 x 4
# Groups: file [106]
file date_admin data_sensor nNA
<chr> <chr> <list> <int>
1 file 1 2018-10-07 <tibble [100,000 x 2]> 0
2 file 10 2019-11-19 <tibble [100,000 x 2]> 19001
3 file 100 2020-02-26 <tibble [100,000 x 2]> 95897
4 file 101 2020-02-14 <tibble [100,000 x 2]> 7769
5 file 102 2020-01-27 <tibble [100,000 x 2]> 99497
6 file 103 2020-01-21 <tibble [100,000 x 2]> 0
7 file 104 2020-03-16 <tibble [100,000 x 2]> 50969
8 file 105 2020-02-26 <tibble [100,000 x 2]> 0
9 file 106 2019-12-31 <tibble [100,000 x 2]> 13673
10 file 11 2019-11-07 <tibble [100,000 x 2]> 16697
# ... with 96 more rows
如您所见,部分数据已更改为NA
。所以一切正常。
您所要做的就是将 xls 文件名读入 data_sensor
,然后使用 group_by (file)
和 group_modify
将数据加载到变量 data_sensor
。祝你好运!