使用 R 中的多个组合按组计算均值和 nmiss,data.table
Compute mean and nmiss by group with multiple combinations in R, data.table
我想计算具有大型数据集的几个组组合的 NA 的平均值和计数。这可能最容易用一些测试数据来解释。我在 Macbook Pro 上使用最新版本的 R 和 data.table 包(数据很大,>1M 行)。
(注意:我在发布这篇文章后注意到我不小心对下面的 "m = " 变量使用了 sum() 而不是 mean() 。我没有编辑它,因为我不想重新 运行一切,不要认为它那么重要)
set.seed(4)
YR = data.table(yr=1962:2015)
ID = data.table(id=10001:11000)
ID2 = data.table(id2 = 20001:20050)
DT <- YR[,as.list(ID), by = yr] # intentional cartesian join
DT <- DT[,as.list(ID2), by = .(yr, id)] # intentional cartesian join
rm("YR","ID","ID2")
# 2.7M obs, now add data
DT[,`:=` (ratio = rep(sample(10),each=27000)+rnorm(nrow(DT)))]
DT <- DT[round(ratio %% 5) == 0, ratio:=NA] # make some of the ratios NA
DT[,`:=` (keep = as.integer(rnorm(nrow(DT)) > 0.7)) ] # add in the indicator variable
# do it again
DT[,`:=` (ratio2 = rep(sample(10),each=27000)+rnorm(nrow(DT)))]
DT <- DT[round(ratio2 %% 4) == 0, ratio2:=NA] # make some of the ratios NA
DT[,`:=` (keep2 = as.integer(rnorm(nrow(DT)) > 0.7)) ] # add in the indicator variable
所以,我有的是识别信息(yr,id,id2)和我想总结的数据:keep1|2,ratio1|2。特别是通过 yr-id,我想使用 keep 和 keep2(从而压缩 id2)来计算平均比率和比率 2。我考虑过通过 keep/keep2 计算比率和比率 2 进行子集化,或者通过 keep*ratio、keep2*ratio、keep*ratio2 和 keep2*ratio2 的矩阵乘法来实现。
首先,我这样做的方式得到了正确的答案,但速度很慢:
system.time(test1 <- DT[,.SD[keep == 1,.(m = sum(ratio,na.rm = TRUE),
nmiss = sum(is.na(ratio)) )
],by=.(yr,id)])
user system elapsed
23.083 0.191 23.319
这在大约相同的时间内同样有效。我认为首先对主要数据进行子集化而不是在 .SD 中进行子集化可能会更快:
system.time(test2 <- DT[keep == 1,.SD[,.(m = sum(ratio,na.rm = TRUE),
nmiss = sum(is.na(ratio)) )
],by=.(yr,id)])
user system elapsed
23.723 0.208 23.963
这两种方法的问题是我需要对每个 keep
变量进行单独的计算。因此我尝试了这种方式:
system.time(test3 <- DT[,.SD[,.( m = sum(ratio*keep,na.rm = TRUE),
nmiss = sum(is.na(ratio*keep)) )
],by=.(yr,id)])
user system elapsed
25.997 0.191 26.217
这允许我将所有公式放在一起(我可以添加 ratio*keep2
、ratio2*keep
和 ratio2*keep2
)但是 1. 它比较慢 2. 它没有得到NA 的正确数量(参见 nmiss
列):
> summary(test1)
yr id m nmiss
Min. :1962 Min. :10001 Min. : -2.154 Min. :0.000
1st Qu.:1975 1st Qu.:10251 1st Qu.: 30.925 1st Qu.:0.000
Median :1988 Median :10500 Median : 53.828 Median :1.000
Mean :1988 Mean :10500 Mean : 59.653 Mean :1.207
3rd Qu.:2002 3rd Qu.:10750 3rd Qu.: 85.550 3rd Qu.:2.000
Max. :2015 Max. :11000 Max. :211.552 Max. :9.000
> summary(test2)
yr id m nmiss
Min. :1962 Min. :10001 Min. : -2.154 Min. :0.000
1st Qu.:1975 1st Qu.:10251 1st Qu.: 30.925 1st Qu.:0.000
Median :1988 Median :10500 Median : 53.828 Median :1.000
Mean :1988 Mean :10500 Mean : 59.653 Mean :1.207
3rd Qu.:2002 3rd Qu.:10750 3rd Qu.: 85.550 3rd Qu.:2.000
Max. :2015 Max. :11000 Max. :211.552 Max. :9.000
> summary(test3)
yr id m nmiss
Min. :1962 Min. :10001 Min. : -2.154 Min. : 0.00
1st Qu.:1975 1st Qu.:10251 1st Qu.: 30.925 1st Qu.: 2.00
Median :1988 Median :10500 Median : 53.828 Median : 4.00
Mean :1988 Mean :10500 Mean : 59.653 Mean : 4.99
3rd Qu.:2002 3rd Qu.:10750 3rd Qu.: 85.550 3rd Qu.: 8.00
Max. :2015 Max. :11000 Max. :211.552 Max. :20.00
通过 yr-id 获取我的 4 种汇总信息组合的最快方法是什么?
现在,我重复使用选项 1 或 2 两次(一次用于 keep,另一次用于 keep2)
可以直接在j
中的表达式中进行汇总:
# solution A: summarize in `.SD`:
system.time({
test2 <- DT[keep == 1,
.SD[, .(m = sum(ratio, na.rm = TRUE),
nmiss = sum(is.na(ratio)))],
by = .(yr, id), verbose = T]
})
# user system elapsed
# 22.359 0.439 22.561
# solution B: summarize directly in j:
system.time({
test2 <- DT[keep == 1, .(m = sum(ratio, na.rm = T),
nmiss = sum(is.na(ratio))),
by = .(yr, id), verbose = T]
})
# user system elapsed
# 0.118 0.077 0.195
添加verbose = T
以显示两种方法之间的区别:
对于解决方案 A:
lapply optimization is on, j unchanged as '.SD[, list(m = sum(ratio,
na.rm = TRUE), nmiss = sum(is.na(ratio)))]' GForce is on, left j
unchanged
Old mean optimization is on, left j unchanged.
Making each group and running j (GForce FALSE) ...
The result of j is
a named list. It's very inefficient to create the same names over and
over again for each group.
当j=list(...)时,检测到任何名称,
为了提高效率,在分组完成后移除并放回原处。
例如,使用 j=transform() 会阻止加速(考虑
更改为 :=)。此消息将来可能会升级为警告。
收集不连续的组需要 0.058 秒,共 54000 个组
eval(j) 进行了 54000 次调用,耗时 22.487 秒
22.521 秒
对于解决方案 B:
...
Finding group sizes from the positions (can be avoided to save RAM)
... 0 sec lapply optimization is on, j unchanged as 'list(sum(ratio,
na.rm = T), sum(is.na(ratio)))'
GForce is on, left j unchanged
Old mean optimization is on, left j unchanged. Making each group and
running j (GForce FALSE) ... collecting discontiguous groups took
0.027s for 54000 groups eval(j) took 0.079s for 54000 calls
0.168 secs
主要区别在于 B 中的汇总被视为命名列表,当有很多组时(此数据为 54k 组!),这非常慢。有关此类型的类似基准,请参阅 this one.
第二部分(你的test3):
我们没有先按 keep = 1
过滤列。所以那些 NA
中 keep !=
也算在 nmiss
中。因此,NA
的计数是不同的。
我想计算具有大型数据集的几个组组合的 NA 的平均值和计数。这可能最容易用一些测试数据来解释。我在 Macbook Pro 上使用最新版本的 R 和 data.table 包(数据很大,>1M 行)。 (注意:我在发布这篇文章后注意到我不小心对下面的 "m = " 变量使用了 sum() 而不是 mean() 。我没有编辑它,因为我不想重新 运行一切,不要认为它那么重要)
set.seed(4)
YR = data.table(yr=1962:2015)
ID = data.table(id=10001:11000)
ID2 = data.table(id2 = 20001:20050)
DT <- YR[,as.list(ID), by = yr] # intentional cartesian join
DT <- DT[,as.list(ID2), by = .(yr, id)] # intentional cartesian join
rm("YR","ID","ID2")
# 2.7M obs, now add data
DT[,`:=` (ratio = rep(sample(10),each=27000)+rnorm(nrow(DT)))]
DT <- DT[round(ratio %% 5) == 0, ratio:=NA] # make some of the ratios NA
DT[,`:=` (keep = as.integer(rnorm(nrow(DT)) > 0.7)) ] # add in the indicator variable
# do it again
DT[,`:=` (ratio2 = rep(sample(10),each=27000)+rnorm(nrow(DT)))]
DT <- DT[round(ratio2 %% 4) == 0, ratio2:=NA] # make some of the ratios NA
DT[,`:=` (keep2 = as.integer(rnorm(nrow(DT)) > 0.7)) ] # add in the indicator variable
所以,我有的是识别信息(yr,id,id2)和我想总结的数据:keep1|2,ratio1|2。特别是通过 yr-id,我想使用 keep 和 keep2(从而压缩 id2)来计算平均比率和比率 2。我考虑过通过 keep/keep2 计算比率和比率 2 进行子集化,或者通过 keep*ratio、keep2*ratio、keep*ratio2 和 keep2*ratio2 的矩阵乘法来实现。
首先,我这样做的方式得到了正确的答案,但速度很慢:
system.time(test1 <- DT[,.SD[keep == 1,.(m = sum(ratio,na.rm = TRUE),
nmiss = sum(is.na(ratio)) )
],by=.(yr,id)])
user system elapsed
23.083 0.191 23.319
这在大约相同的时间内同样有效。我认为首先对主要数据进行子集化而不是在 .SD 中进行子集化可能会更快:
system.time(test2 <- DT[keep == 1,.SD[,.(m = sum(ratio,na.rm = TRUE),
nmiss = sum(is.na(ratio)) )
],by=.(yr,id)])
user system elapsed
23.723 0.208 23.963
这两种方法的问题是我需要对每个 keep
变量进行单独的计算。因此我尝试了这种方式:
system.time(test3 <- DT[,.SD[,.( m = sum(ratio*keep,na.rm = TRUE),
nmiss = sum(is.na(ratio*keep)) )
],by=.(yr,id)])
user system elapsed
25.997 0.191 26.217
这允许我将所有公式放在一起(我可以添加 ratio*keep2
、ratio2*keep
和 ratio2*keep2
)但是 1. 它比较慢 2. 它没有得到NA 的正确数量(参见 nmiss
列):
> summary(test1)
yr id m nmiss
Min. :1962 Min. :10001 Min. : -2.154 Min. :0.000
1st Qu.:1975 1st Qu.:10251 1st Qu.: 30.925 1st Qu.:0.000
Median :1988 Median :10500 Median : 53.828 Median :1.000
Mean :1988 Mean :10500 Mean : 59.653 Mean :1.207
3rd Qu.:2002 3rd Qu.:10750 3rd Qu.: 85.550 3rd Qu.:2.000
Max. :2015 Max. :11000 Max. :211.552 Max. :9.000
> summary(test2)
yr id m nmiss
Min. :1962 Min. :10001 Min. : -2.154 Min. :0.000
1st Qu.:1975 1st Qu.:10251 1st Qu.: 30.925 1st Qu.:0.000
Median :1988 Median :10500 Median : 53.828 Median :1.000
Mean :1988 Mean :10500 Mean : 59.653 Mean :1.207
3rd Qu.:2002 3rd Qu.:10750 3rd Qu.: 85.550 3rd Qu.:2.000
Max. :2015 Max. :11000 Max. :211.552 Max. :9.000
> summary(test3)
yr id m nmiss
Min. :1962 Min. :10001 Min. : -2.154 Min. : 0.00
1st Qu.:1975 1st Qu.:10251 1st Qu.: 30.925 1st Qu.: 2.00
Median :1988 Median :10500 Median : 53.828 Median : 4.00
Mean :1988 Mean :10500 Mean : 59.653 Mean : 4.99
3rd Qu.:2002 3rd Qu.:10750 3rd Qu.: 85.550 3rd Qu.: 8.00
Max. :2015 Max. :11000 Max. :211.552 Max. :20.00
通过 yr-id 获取我的 4 种汇总信息组合的最快方法是什么? 现在,我重复使用选项 1 或 2 两次(一次用于 keep,另一次用于 keep2)
可以直接在j
中的表达式中进行汇总:
# solution A: summarize in `.SD`:
system.time({
test2 <- DT[keep == 1,
.SD[, .(m = sum(ratio, na.rm = TRUE),
nmiss = sum(is.na(ratio)))],
by = .(yr, id), verbose = T]
})
# user system elapsed
# 22.359 0.439 22.561
# solution B: summarize directly in j:
system.time({
test2 <- DT[keep == 1, .(m = sum(ratio, na.rm = T),
nmiss = sum(is.na(ratio))),
by = .(yr, id), verbose = T]
})
# user system elapsed
# 0.118 0.077 0.195
添加verbose = T
以显示两种方法之间的区别:
对于解决方案 A:
lapply optimization is on, j unchanged as '.SD[, list(m = sum(ratio, na.rm = TRUE), nmiss = sum(is.na(ratio)))]' GForce is on, left j unchanged
Old mean optimization is on, left j unchanged.
Making each group and running j (GForce FALSE) ... The result of j is
a named list. It's very inefficient to create the same names over and over again for each group.
当j=list(...)时,检测到任何名称, 为了提高效率,在分组完成后移除并放回原处。 例如,使用 j=transform() 会阻止加速(考虑 更改为 :=)。此消息将来可能会升级为警告。
收集不连续的组需要 0.058 秒,共 54000 个组
eval(j) 进行了 54000 次调用,耗时 22.487 秒 22.521 秒
对于解决方案 B:
...
Finding group sizes from the positions (can be avoided to save RAM) ... 0 sec lapply optimization is on, j unchanged as 'list(sum(ratio, na.rm = T), sum(is.na(ratio)))'
GForce is on, left j unchanged
Old mean optimization is on, left j unchanged. Making each group and running j (GForce FALSE) ... collecting discontiguous groups took 0.027s for 54000 groups eval(j) took 0.079s for 54000 calls 0.168 secs
主要区别在于 B 中的汇总被视为命名列表,当有很多组时(此数据为 54k 组!),这非常慢。有关此类型的类似基准,请参阅 this one.
第二部分(你的test3):
我们没有先按 keep = 1
过滤列。所以那些 NA
中 keep !=
也算在 nmiss
中。因此,NA
的计数是不同的。