R data.table:优化(不同)组的行操作速度
R data.table: optimize speed of row operations by (different) groups
所以,我有这个庞大的数据集(可能有数千个条目),我的代码以 data.table 的长格式接收这些数据集,如下所示:
#sample DT
sample_size = 8
DT0 <- data.table(
DATE = seq.Date(from = as.Date("2020/3/01"), by = "day", length.out = sample_size),
BANANA = seq(30, by=0.060, length.out = sample_size),
ORANGE = seq(5, by=0.035, length.out = sample_size),
APPLE = seq(12, by=0.6, length.out = sample_size),
LEMON = seq(10, by=0.01, length.out = sample_size),
GRAPE = seq(0.5, by=0.13, length.out = sample_size)
)
DT <- melt(DT0,
id.vars = c("DATE"),
variable.name = "FRUIT",
value.name = "PRICE")
setkeyv(DT, cols=c("FRUIT", "DATE"))
DT 正是我的数据现在的样子:
> DT
DATE FRUIT PRICE
1: 2020-03-01 BANANA 30.000
2: 2020-03-02 BANANA 30.060
3: 2020-03-03 BANANA 30.120
4: 2020-03-04 BANANA 30.180
5: 2020-03-05 BANANA 30.240
6: 2020-03-06 BANANA 30.300
7: 2020-03-07 BANANA 30.360
8: 2020-03-08 BANANA 30.420
9: 2020-03-01 ORANGE 5.000
10: 2020-03-02 ORANGE 5.035
11: 2020-03-03 ORANGE 5.070
12: 2020-03-04 ORANGE 5.105
13: 2020-03-05 ORANGE 5.140
14: 2020-03-06 ORANGE 5.175
15: 2020-03-07 ORANGE 5.210
16: 2020-03-08 ORANGE 5.245
17: 2020-03-01 APPLE 12.000
18: 2020-03-02 APPLE 12.600
19: 2020-03-03 APPLE 13.200
20: 2020-03-04 APPLE 13.800
21: 2020-03-05 APPLE 14.400
22: 2020-03-06 APPLE 15.000
23: 2020-03-07 APPLE 15.600
24: 2020-03-08 APPLE 16.200
25: 2020-03-01 LEMON 10.000
26: 2020-03-02 LEMON 10.010
27: 2020-03-03 LEMON 10.020
28: 2020-03-04 LEMON 10.030
29: 2020-03-05 LEMON 10.040
30: 2020-03-06 LEMON 10.050
31: 2020-03-07 LEMON 10.060
32: 2020-03-08 LEMON 10.070
33: 2020-03-01 GRAPE 0.500
34: 2020-03-02 GRAPE 0.630
35: 2020-03-03 GRAPE 0.760
36: 2020-03-04 GRAPE 0.890
37: 2020-03-05 GRAPE 1.020
38: 2020-03-06 GRAPE 1.150
39: 2020-03-07 GRAPE 1.280
40: 2020-03-08 GRAPE 1.410
DATE FRUIT PRICE
现在,假设我需要在一个新列(“RESULT”)中计算 FRUIT 中每个项目的价格与一个常数的价格之间的差值(或任何其他更复杂的运算)每天指定水果(例如:葡萄)。
只是为了帮助您将想法形象化,RESULT 列将类似于这些操作的结果,对于每一天:
result := PRICE BANANA - PRICE GRAPE
result := PRICE ORANGE - PRICE GRAPE
result := PRICE APPLE - PRICE GRAPE
result := PRICE BANANA - PRICE GRAPE
result := PRICE LEMON - PRICE GRAPE
result := PRICE GRAPE - PRICE GRAPE
经过数小时的反复试验(但仍然不太了解我在做什么),这是我设法做到的:
#my try:
chosen_fruit <- "GRAPE"
setkey(DT, DATE)
DT[DT[FRUIT == chosen_fruit], RESULTS := PRICE - i.PRICE]
> DT
DATE FRUIT PRICE RESULTS
1: 2020-03-01 BANANA 30.000 29.500
2: 2020-03-01 ORANGE 5.000 4.500
3: 2020-03-01 APPLE 12.000 11.500
4: 2020-03-01 LEMON 10.000 9.500
5: 2020-03-01 GRAPE 0.500 0.000
6: 2020-03-02 BANANA 30.060 29.430
7: 2020-03-02 ORANGE 5.035 4.405
8: 2020-03-02 APPLE 12.600 11.970
9: 2020-03-02 LEMON 10.010 9.380
10: 2020-03-02 GRAPE 0.630 0.000
11: 2020-03-03 BANANA 30.120 29.360
12: 2020-03-03 ORANGE 5.070 4.310
13: 2020-03-03 APPLE 13.200 12.440
14: 2020-03-03 LEMON 10.020 9.260
15: 2020-03-03 GRAPE 0.760 0.000
16: 2020-03-04 BANANA 30.180 29.290
17: 2020-03-04 ORANGE 5.105 4.215
18: 2020-03-04 APPLE 13.800 12.910
19: 2020-03-04 LEMON 10.030 9.140
20: 2020-03-04 GRAPE 0.890 0.000
21: 2020-03-05 BANANA 30.240 29.220
22: 2020-03-05 ORANGE 5.140 4.120
23: 2020-03-05 APPLE 14.400 13.380
24: 2020-03-05 LEMON 10.040 9.020
25: 2020-03-05 GRAPE 1.020 0.000
26: 2020-03-06 BANANA 30.300 29.150
27: 2020-03-06 ORANGE 5.175 4.025
28: 2020-03-06 APPLE 15.000 13.850
29: 2020-03-06 LEMON 10.050 8.900
30: 2020-03-06 GRAPE 1.150 0.000
31: 2020-03-07 BANANA 30.360 29.080
32: 2020-03-07 ORANGE 5.210 3.930
33: 2020-03-07 APPLE 15.600 14.320
34: 2020-03-07 LEMON 10.060 8.780
35: 2020-03-07 GRAPE 1.280 0.000
36: 2020-03-08 BANANA 30.420 29.010
37: 2020-03-08 ORANGE 5.245 3.835
38: 2020-03-08 APPLE 16.200 14.790
39: 2020-03-08 LEMON 10.070 8.660
40: 2020-03-08 GRAPE 1.410 0.000
DATE FRUIT PRICE RESULTS
我最终想要的输出:
setkey(DT, FRUIT)
> DT
DATE FRUIT PRICE RESULTS
1: 2020-03-01 BANANA 30.000 29.500
2: 2020-03-02 BANANA 30.060 29.430
3: 2020-03-03 BANANA 30.120 29.360
4: 2020-03-04 BANANA 30.180 29.290
5: 2020-03-05 BANANA 30.240 29.220
6: 2020-03-06 BANANA 30.300 29.150
7: 2020-03-07 BANANA 30.360 29.080
8: 2020-03-08 BANANA 30.420 29.010
9: 2020-03-01 ORANGE 5.000 4.500
10: 2020-03-02 ORANGE 5.035 4.405
11: 2020-03-03 ORANGE 5.070 4.310
12: 2020-03-04 ORANGE 5.105 4.215
13: 2020-03-05 ORANGE 5.140 4.120
14: 2020-03-06 ORANGE 5.175 4.025
15: 2020-03-07 ORANGE 5.210 3.930
16: 2020-03-08 ORANGE 5.245 3.835
17: 2020-03-01 APPLE 12.000 11.500
18: 2020-03-02 APPLE 12.600 11.970
19: 2020-03-03 APPLE 13.200 12.440
20: 2020-03-04 APPLE 13.800 12.910
21: 2020-03-05 APPLE 14.400 13.380
22: 2020-03-06 APPLE 15.000 13.850
23: 2020-03-07 APPLE 15.600 14.320
24: 2020-03-08 APPLE 16.200 14.790
25: 2020-03-01 LEMON 10.000 9.500
26: 2020-03-02 LEMON 10.010 9.380
27: 2020-03-03 LEMON 10.020 9.260
28: 2020-03-04 LEMON 10.030 9.140
29: 2020-03-05 LEMON 10.040 9.020
30: 2020-03-06 LEMON 10.050 8.900
31: 2020-03-07 LEMON 10.060 8.780
32: 2020-03-08 LEMON 10.070 8.660
33: 2020-03-01 GRAPE 0.500 0.000
34: 2020-03-02 GRAPE 0.630 0.000
35: 2020-03-03 GRAPE 0.760 0.000
36: 2020-03-04 GRAPE 0.890 0.000
37: 2020-03-05 GRAPE 1.020 0.000
38: 2020-03-06 GRAPE 1.150 0.000
39: 2020-03-07 GRAPE 1.280 0.000
40: 2020-03-08 GRAPE 1.410 0.000
DATE FRUIT PRICE RESULTS
如您所见,我以某种方式达到了目标。不过好像不太理想。
我现在正在学习 R(和一般编程),所以我对所有这些都非常陌生,尤其是在尽可能优化代码方面(考虑此处的内存限制)。
关于如何改进 speed/performance 操作或更好的方法有什么建议吗?
非常感谢您!
在 OP 的代码中,我们不需要 ==
一旦我们 set
键,即第一个 setkey
就足够了,并通过 [ 加入 on
=37=] 同时做 PRICE
和 i.PRICE
的减法
setkeyv(DT, cols=c("FRUIT", "DATE"))
DT[DT[.(chosen_fruit)], RESULTS := PRICE - i.PRICE, on = .(DATE)]
或者另一种选择是按 'DATE' 分组,从相应的 PRICE 中减去 'PRICE',其中 'FRUIT' 是 'GRAPE'
library(data.table)
DT[, RESULTS := PRICE - PRICE[FRUIT == 'GRAPE'], DATE]
-输出
DT
DATE FRUIT PRICE RESULTS
1: 2020-03-01 BANANA 30.000 29.500
2: 2020-03-02 BANANA 30.060 29.430
3: 2020-03-03 BANANA 30.120 29.360
4: 2020-03-04 BANANA 30.180 29.290
5: 2020-03-05 BANANA 30.240 29.220
6: 2020-03-06 BANANA 30.300 29.150
7: 2020-03-07 BANANA 30.360 29.080
8: 2020-03-08 BANANA 30.420 29.010
9: 2020-03-01 ORANGE 5.000 4.500
10: 2020-03-02 ORANGE 5.035 4.405
11: 2020-03-03 ORANGE 5.070 4.310
12: 2020-03-04 ORANGE 5.105 4.215
13: 2020-03-05 ORANGE 5.140 4.120
14: 2020-03-06 ORANGE 5.175 4.025
15: 2020-03-07 ORANGE 5.210 3.930
16: 2020-03-08 ORANGE 5.245 3.835
17: 2020-03-01 APPLE 12.000 11.500
18: 2020-03-02 APPLE 12.600 11.970
19: 2020-03-03 APPLE 13.200 12.440
20: 2020-03-04 APPLE 13.800 12.910
21: 2020-03-05 APPLE 14.400 13.380
22: 2020-03-06 APPLE 15.000 13.850
23: 2020-03-07 APPLE 15.600 14.320
24: 2020-03-08 APPLE 16.200 14.790
25: 2020-03-01 LEMON 10.000 9.500
26: 2020-03-02 LEMON 10.010 9.380
27: 2020-03-03 LEMON 10.020 9.260
28: 2020-03-04 LEMON 10.030 9.140
29: 2020-03-05 LEMON 10.040 9.020
30: 2020-03-06 LEMON 10.050 8.900
31: 2020-03-07 LEMON 10.060 8.780
32: 2020-03-08 LEMON 10.070 8.660
33: 2020-03-01 GRAPE 0.500 0.000
34: 2020-03-02 GRAPE 0.630 0.000
35: 2020-03-03 GRAPE 0.760 0.000
36: 2020-03-04 GRAPE 0.890 0.000
37: 2020-03-05 GRAPE 1.020 0.000
38: 2020-03-06 GRAPE 1.150 0.000
39: 2020-03-07 GRAPE 1.280 0.000
40: 2020-03-08 GRAPE 1.410 0.000
或者另一个选择是 dcast
到 'wide' 格式然后做减法
dt_wide <- dcast(DT, DATE ~ FRUIT, value.var = 'PRICE')
nm1 <- names(dt_wide)[-1]
dt_wide[, (nm1) := lapply(.SD, function(x) x - GRAPE), .SDcols = nm1]
基准
通过在构建输入数据时更改 sample_size
在稍大的数据集上进行测试
sample_size <- 1000000
dim(DT)
#[1] 5000000 3
system.time(DT[DT[.(chosen_fruit)], RESULTS := PRICE - i.PRICE, on = .(DATE)])
# user system elapsed
# 0.287 0.039 0.326
system.time({ DT[DT[FRUIT == chosen_fruit], RESULTS := PRICE - i.PRICE, on = .(DATE)] })
# user system elapsed
# 0.294 0.006 0.300
system.time({
setkey(DT, DATE)
DT[DT[FRUIT == chosen_fruit], RESULTS := PRICE - i.PRICE]
setkey(DT, FRUIT)
})
# user system elapsed
# 0.431 0.045 0.476
system.time(DT[, RESULTS := PRICE - PRICE[FRUIT == 'GRAPE'], DATE])
# user system elapsed
# 6.660 0.039 6.665
system.time({
dt_wide <- dcast(DT, DATE ~ FRUIT, value.var = 'PRICE')
nm1 <- names(dt_wide)[-1]
dt_wide[, (nm1) := lapply(.SD, function(x) x - GRAPE), .SDcols = nm1]
})
# user system elapsed
# 0.868 0.060 0.926
所以,我有这个庞大的数据集(可能有数千个条目),我的代码以 data.table 的长格式接收这些数据集,如下所示:
#sample DT
sample_size = 8
DT0 <- data.table(
DATE = seq.Date(from = as.Date("2020/3/01"), by = "day", length.out = sample_size),
BANANA = seq(30, by=0.060, length.out = sample_size),
ORANGE = seq(5, by=0.035, length.out = sample_size),
APPLE = seq(12, by=0.6, length.out = sample_size),
LEMON = seq(10, by=0.01, length.out = sample_size),
GRAPE = seq(0.5, by=0.13, length.out = sample_size)
)
DT <- melt(DT0,
id.vars = c("DATE"),
variable.name = "FRUIT",
value.name = "PRICE")
setkeyv(DT, cols=c("FRUIT", "DATE"))
DT 正是我的数据现在的样子:
> DT
DATE FRUIT PRICE
1: 2020-03-01 BANANA 30.000
2: 2020-03-02 BANANA 30.060
3: 2020-03-03 BANANA 30.120
4: 2020-03-04 BANANA 30.180
5: 2020-03-05 BANANA 30.240
6: 2020-03-06 BANANA 30.300
7: 2020-03-07 BANANA 30.360
8: 2020-03-08 BANANA 30.420
9: 2020-03-01 ORANGE 5.000
10: 2020-03-02 ORANGE 5.035
11: 2020-03-03 ORANGE 5.070
12: 2020-03-04 ORANGE 5.105
13: 2020-03-05 ORANGE 5.140
14: 2020-03-06 ORANGE 5.175
15: 2020-03-07 ORANGE 5.210
16: 2020-03-08 ORANGE 5.245
17: 2020-03-01 APPLE 12.000
18: 2020-03-02 APPLE 12.600
19: 2020-03-03 APPLE 13.200
20: 2020-03-04 APPLE 13.800
21: 2020-03-05 APPLE 14.400
22: 2020-03-06 APPLE 15.000
23: 2020-03-07 APPLE 15.600
24: 2020-03-08 APPLE 16.200
25: 2020-03-01 LEMON 10.000
26: 2020-03-02 LEMON 10.010
27: 2020-03-03 LEMON 10.020
28: 2020-03-04 LEMON 10.030
29: 2020-03-05 LEMON 10.040
30: 2020-03-06 LEMON 10.050
31: 2020-03-07 LEMON 10.060
32: 2020-03-08 LEMON 10.070
33: 2020-03-01 GRAPE 0.500
34: 2020-03-02 GRAPE 0.630
35: 2020-03-03 GRAPE 0.760
36: 2020-03-04 GRAPE 0.890
37: 2020-03-05 GRAPE 1.020
38: 2020-03-06 GRAPE 1.150
39: 2020-03-07 GRAPE 1.280
40: 2020-03-08 GRAPE 1.410
DATE FRUIT PRICE
现在,假设我需要在一个新列(“RESULT”)中计算 FRUIT 中每个项目的价格与一个常数的价格之间的差值(或任何其他更复杂的运算)每天指定水果(例如:葡萄)。
只是为了帮助您将想法形象化,RESULT 列将类似于这些操作的结果,对于每一天:
result := PRICE BANANA - PRICE GRAPE
result := PRICE ORANGE - PRICE GRAPE
result := PRICE APPLE - PRICE GRAPE
result := PRICE BANANA - PRICE GRAPE
result := PRICE LEMON - PRICE GRAPE
result := PRICE GRAPE - PRICE GRAPE
经过数小时的反复试验(但仍然不太了解我在做什么),这是我设法做到的:
#my try:
chosen_fruit <- "GRAPE"
setkey(DT, DATE)
DT[DT[FRUIT == chosen_fruit], RESULTS := PRICE - i.PRICE]
> DT
DATE FRUIT PRICE RESULTS
1: 2020-03-01 BANANA 30.000 29.500
2: 2020-03-01 ORANGE 5.000 4.500
3: 2020-03-01 APPLE 12.000 11.500
4: 2020-03-01 LEMON 10.000 9.500
5: 2020-03-01 GRAPE 0.500 0.000
6: 2020-03-02 BANANA 30.060 29.430
7: 2020-03-02 ORANGE 5.035 4.405
8: 2020-03-02 APPLE 12.600 11.970
9: 2020-03-02 LEMON 10.010 9.380
10: 2020-03-02 GRAPE 0.630 0.000
11: 2020-03-03 BANANA 30.120 29.360
12: 2020-03-03 ORANGE 5.070 4.310
13: 2020-03-03 APPLE 13.200 12.440
14: 2020-03-03 LEMON 10.020 9.260
15: 2020-03-03 GRAPE 0.760 0.000
16: 2020-03-04 BANANA 30.180 29.290
17: 2020-03-04 ORANGE 5.105 4.215
18: 2020-03-04 APPLE 13.800 12.910
19: 2020-03-04 LEMON 10.030 9.140
20: 2020-03-04 GRAPE 0.890 0.000
21: 2020-03-05 BANANA 30.240 29.220
22: 2020-03-05 ORANGE 5.140 4.120
23: 2020-03-05 APPLE 14.400 13.380
24: 2020-03-05 LEMON 10.040 9.020
25: 2020-03-05 GRAPE 1.020 0.000
26: 2020-03-06 BANANA 30.300 29.150
27: 2020-03-06 ORANGE 5.175 4.025
28: 2020-03-06 APPLE 15.000 13.850
29: 2020-03-06 LEMON 10.050 8.900
30: 2020-03-06 GRAPE 1.150 0.000
31: 2020-03-07 BANANA 30.360 29.080
32: 2020-03-07 ORANGE 5.210 3.930
33: 2020-03-07 APPLE 15.600 14.320
34: 2020-03-07 LEMON 10.060 8.780
35: 2020-03-07 GRAPE 1.280 0.000
36: 2020-03-08 BANANA 30.420 29.010
37: 2020-03-08 ORANGE 5.245 3.835
38: 2020-03-08 APPLE 16.200 14.790
39: 2020-03-08 LEMON 10.070 8.660
40: 2020-03-08 GRAPE 1.410 0.000
DATE FRUIT PRICE RESULTS
我最终想要的输出:
setkey(DT, FRUIT)
> DT
DATE FRUIT PRICE RESULTS
1: 2020-03-01 BANANA 30.000 29.500
2: 2020-03-02 BANANA 30.060 29.430
3: 2020-03-03 BANANA 30.120 29.360
4: 2020-03-04 BANANA 30.180 29.290
5: 2020-03-05 BANANA 30.240 29.220
6: 2020-03-06 BANANA 30.300 29.150
7: 2020-03-07 BANANA 30.360 29.080
8: 2020-03-08 BANANA 30.420 29.010
9: 2020-03-01 ORANGE 5.000 4.500
10: 2020-03-02 ORANGE 5.035 4.405
11: 2020-03-03 ORANGE 5.070 4.310
12: 2020-03-04 ORANGE 5.105 4.215
13: 2020-03-05 ORANGE 5.140 4.120
14: 2020-03-06 ORANGE 5.175 4.025
15: 2020-03-07 ORANGE 5.210 3.930
16: 2020-03-08 ORANGE 5.245 3.835
17: 2020-03-01 APPLE 12.000 11.500
18: 2020-03-02 APPLE 12.600 11.970
19: 2020-03-03 APPLE 13.200 12.440
20: 2020-03-04 APPLE 13.800 12.910
21: 2020-03-05 APPLE 14.400 13.380
22: 2020-03-06 APPLE 15.000 13.850
23: 2020-03-07 APPLE 15.600 14.320
24: 2020-03-08 APPLE 16.200 14.790
25: 2020-03-01 LEMON 10.000 9.500
26: 2020-03-02 LEMON 10.010 9.380
27: 2020-03-03 LEMON 10.020 9.260
28: 2020-03-04 LEMON 10.030 9.140
29: 2020-03-05 LEMON 10.040 9.020
30: 2020-03-06 LEMON 10.050 8.900
31: 2020-03-07 LEMON 10.060 8.780
32: 2020-03-08 LEMON 10.070 8.660
33: 2020-03-01 GRAPE 0.500 0.000
34: 2020-03-02 GRAPE 0.630 0.000
35: 2020-03-03 GRAPE 0.760 0.000
36: 2020-03-04 GRAPE 0.890 0.000
37: 2020-03-05 GRAPE 1.020 0.000
38: 2020-03-06 GRAPE 1.150 0.000
39: 2020-03-07 GRAPE 1.280 0.000
40: 2020-03-08 GRAPE 1.410 0.000
DATE FRUIT PRICE RESULTS
如您所见,我以某种方式达到了目标。不过好像不太理想。
我现在正在学习 R(和一般编程),所以我对所有这些都非常陌生,尤其是在尽可能优化代码方面(考虑此处的内存限制)。
关于如何改进 speed/performance 操作或更好的方法有什么建议吗? 非常感谢您!
在 OP 的代码中,我们不需要 ==
一旦我们 set
键,即第一个 setkey
就足够了,并通过 [ 加入 on
=37=] 同时做 PRICE
和 i.PRICE
setkeyv(DT, cols=c("FRUIT", "DATE"))
DT[DT[.(chosen_fruit)], RESULTS := PRICE - i.PRICE, on = .(DATE)]
或者另一种选择是按 'DATE' 分组,从相应的 PRICE 中减去 'PRICE',其中 'FRUIT' 是 'GRAPE'
library(data.table)
DT[, RESULTS := PRICE - PRICE[FRUIT == 'GRAPE'], DATE]
-输出
DT
DATE FRUIT PRICE RESULTS
1: 2020-03-01 BANANA 30.000 29.500
2: 2020-03-02 BANANA 30.060 29.430
3: 2020-03-03 BANANA 30.120 29.360
4: 2020-03-04 BANANA 30.180 29.290
5: 2020-03-05 BANANA 30.240 29.220
6: 2020-03-06 BANANA 30.300 29.150
7: 2020-03-07 BANANA 30.360 29.080
8: 2020-03-08 BANANA 30.420 29.010
9: 2020-03-01 ORANGE 5.000 4.500
10: 2020-03-02 ORANGE 5.035 4.405
11: 2020-03-03 ORANGE 5.070 4.310
12: 2020-03-04 ORANGE 5.105 4.215
13: 2020-03-05 ORANGE 5.140 4.120
14: 2020-03-06 ORANGE 5.175 4.025
15: 2020-03-07 ORANGE 5.210 3.930
16: 2020-03-08 ORANGE 5.245 3.835
17: 2020-03-01 APPLE 12.000 11.500
18: 2020-03-02 APPLE 12.600 11.970
19: 2020-03-03 APPLE 13.200 12.440
20: 2020-03-04 APPLE 13.800 12.910
21: 2020-03-05 APPLE 14.400 13.380
22: 2020-03-06 APPLE 15.000 13.850
23: 2020-03-07 APPLE 15.600 14.320
24: 2020-03-08 APPLE 16.200 14.790
25: 2020-03-01 LEMON 10.000 9.500
26: 2020-03-02 LEMON 10.010 9.380
27: 2020-03-03 LEMON 10.020 9.260
28: 2020-03-04 LEMON 10.030 9.140
29: 2020-03-05 LEMON 10.040 9.020
30: 2020-03-06 LEMON 10.050 8.900
31: 2020-03-07 LEMON 10.060 8.780
32: 2020-03-08 LEMON 10.070 8.660
33: 2020-03-01 GRAPE 0.500 0.000
34: 2020-03-02 GRAPE 0.630 0.000
35: 2020-03-03 GRAPE 0.760 0.000
36: 2020-03-04 GRAPE 0.890 0.000
37: 2020-03-05 GRAPE 1.020 0.000
38: 2020-03-06 GRAPE 1.150 0.000
39: 2020-03-07 GRAPE 1.280 0.000
40: 2020-03-08 GRAPE 1.410 0.000
或者另一个选择是 dcast
到 'wide' 格式然后做减法
dt_wide <- dcast(DT, DATE ~ FRUIT, value.var = 'PRICE')
nm1 <- names(dt_wide)[-1]
dt_wide[, (nm1) := lapply(.SD, function(x) x - GRAPE), .SDcols = nm1]
基准
通过在构建输入数据时更改 sample_size
在稍大的数据集上进行测试
sample_size <- 1000000
dim(DT)
#[1] 5000000 3
system.time(DT[DT[.(chosen_fruit)], RESULTS := PRICE - i.PRICE, on = .(DATE)])
# user system elapsed
# 0.287 0.039 0.326
system.time({ DT[DT[FRUIT == chosen_fruit], RESULTS := PRICE - i.PRICE, on = .(DATE)] })
# user system elapsed
# 0.294 0.006 0.300
system.time({
setkey(DT, DATE)
DT[DT[FRUIT == chosen_fruit], RESULTS := PRICE - i.PRICE]
setkey(DT, FRUIT)
})
# user system elapsed
# 0.431 0.045 0.476
system.time(DT[, RESULTS := PRICE - PRICE[FRUIT == 'GRAPE'], DATE])
# user system elapsed
# 6.660 0.039 6.665
system.time({
dt_wide <- dcast(DT, DATE ~ FRUIT, value.var = 'PRICE')
nm1 <- names(dt_wide)[-1]
dt_wide[, (nm1) := lapply(.SD, function(x) x - GRAPE), .SDcols = nm1]
})
# user system elapsed
# 0.868 0.060 0.926