R/data.table: 优化"recursive" groupby
R/data.table: optimize "recursive" groupby
我正在处理带有基因组数据的大型 data.table(1e6-10e6 行,10 列)。我想通过将组减少为单行来减少数据。这种减少取决于多个列,但在连续的步骤中。示例数据如下:
dt.tmp <- data.table(str1=paste0("A",sample(1:100, 2000, replace=TRUE)),
str2=paste0("B",sample(1:5, 2000, replace=TRUE)),
c1=sample(1:3,2000, replace=T),
c2=sample(1:3,2000,replace=T),
d1=sample(1:2,2000,replace=T),
d2=sample(1:2,2000,replace=TRUE))
对于此数据,我想减少 str1 列,使用以下步骤:
- 在 str1 定义的组中,根据 str2 和 select 创建最大的组
- 结果组 select 组中最大 (c1+c2)
- 在结果组 select 组中最大 (d1+d2)
- 在结果组中 select 随机行
我尝试过各种在 .SD 上操作的组合,例如:
dt.tmp[,':='(c=c1+c2, d=d1+d2,rnd=sample.int(.N))
][,':='(n=.N),by=.(str1,str2)
][,.SD[n==max(n),
.SD[c==max(c),
.SD[d==max(d),
.SD[rnd==max(rnd)], by=d],
by=c],
by=n],
by=str1];
我最后一次尝试尽量减少使用 .SD:
dt.tmp[,':='(c=c1+c2, d=d1+d2, rnd=sample.int(.N))
][,':='(n=.N,cmaxidx=(c==max(c))),by=.(str1,str2)
][,':='(nmaxidx=(n==max(n))),by=str1
][,':='(dmaxidx=(d==max(d))),by=.(str1,str2,c)
][,.SD[dmaxidx&cmaxidx&nmaxidx
][rnd==max(rnd)], by=str1
][,':='(c=NULL,d=NULL,nmaxidx=NULL,cmaxidx=NULL,dmaxidx=NULL,n=NULL,rnd=NULL)][,.SD]
(后面的操作只是清理和打印)
我一点也不 "in to" data.table。是否有明显的优化我可以应用到上面 problem/code 以减少执行时间(目前我需要 200-300 左右 CPU 小时,在我们的服务器上使用最多 24 个内核减少到 14 左右的时钟小时) .
真实数据如下:
Classes 'data.table' and 'data.frame': 50259993 obs. of 26 variables:
$ BC : chr "AAAAAAAAAAAACAAGGTCG" "AAAAAAAAAAAACTACCGTG" "AAAAAAAAAAAAGCACTGAG" "AAAAAAAAAAAAGCACTGAG" ...
$ chrom : chr "chr2L" "chr2R" "chr2R" "chr2R" ...
$ start : int 22371281 12477441 8323580 8323580 17304870 31837917 24897443 22469324 22469324 18294732 ...
$ end : int 22371463 12477734 8323924 8323924 17305040 31838183 24897665 22469723 22469723 18295044 ...
$ strand : chr "+" "+" "-" "-" ...
$ MAPQ1 : int 1 40 42 42 42 42 24 1 1 42 ...
$ MAPQ2 : int 1 40 42 42 42 42 24 1 1 42 ...
$ AS1 : int -3 -33 0 -3 -12 -6 -39 0 0 0 ...
$ AS2 : int -12 -3 -18 -15 0 0 -3 -5 -20 -6 ...
$ XS1 : num -3 NA NA NA NA NA NA 0 0 NA ...
$ XS2 : num -12 NA NA NA NA NA NA 0 -15 NA ...
$ SNP_ABS_POS: chr "22371329,22371329,22371356,22371356,22371437" "12477460,12477500,12477524,12477707,12477719" "8323582,8323583,8323588,8323750,8323759,8323791,8323868,8323878" "8323582,8323583,8323588,8323750,8323759,8323791,8323868,8323878" ...
$ SNP_REL_POS: chr "48,48,75,75,156" "19,59,83,266,278" "2,3,8,170,179,211,288,298" "2,3,8,170,179,211,288,298" ...
$ SNP_ID : chr ".,.,.,.,." ".,.,.,.,." ".,.,.,.,.,.,.,." ".,.,.,.,.,.,.,." ...
$ SNP_SEQ : chr "CCCTTCATCGCACGAATGTGTGCGT,CCCTTCATCGCACGAATGTGAGCGT,A,A,T" "T,G,ACCGGCATCCATCCATCCAT,T,C" "T,T,ACG,A,G,G,C,T" "T,T,ACG,A,G,G,C,T" ...
$ SNP_VAR : chr "-3,-3,0,0,0" "0,-1,-2,-1,0" "1,1,-3,-2,-2,-2,-1,-1" "1,1,-3,-2,-2,-2,-1,-1" ...
$ SNP_PARENT : chr "unexpected,unexpected,expected,expected,expected" "expected,non_parental_allele,unread,non_parental_allele,expected" "expected,expected,unexpected,unread,unread,unread,non_parental_allele,non_parental_allele" "expected,expected,unexpected,unread,unread,unread,non_parental_allele,non_parental_allele" ...
$ SNP_TYPE : chr "indel,indel,snp,snp,snp" "snp,snp,indel,snp,snp" "snp,indel,indel,snp,snp,snp,snp,snp" "snp,indel,indel,snp,snp,snp,snp,snp" ...
$ SNP_SUBTYPE: chr "del,del,ts,ts,tv" "tv,tv,del,tv,ts" "tv,del,ins,tv,tv,tv,ts,tv" "tv,del,ins,tv,tv,tv,ts,tv" ...
- attr(*, ".internal.selfref")=<externalptr>
- attr(*, "sorted")= chr "BC" "chrom" "start" "end"
其中 BC=str1,chrom+start+end=str2,MAPQ1/2=c1/2,AS1/2=d1/2。此数据减少到大约 20e6 行。
输入数据按 chrom、start、end 排序。是否有使用特定排序的有利方法?
我是否正确地认为使用 .SD 需要额外的内存(尽管内存并不是真正的问题 atm)因此不是最佳的?
如有任何帮助和指点,我们将不胜感激。
会话信息:
R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.6 LTS
Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
locale:
[1] C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.12.2
loaded via a namespace (and not attached):
[1] compiler_3.6.1 R.methodsS3_1.7.1 R.utils_2.8.0 R.oo_1.22.0
分解为几个步骤:
# Within group defined by str1 create groups based on str2 and select the largest group(s)
combinations2keep <- dt.tmp[, .N, by = .(str1, str2)
][, .SD[N == max(N)], by = str1
][, !"N"]
dt.tmp <- dt.tmp[combinations2keep, on = .(str1, str2)]
# In resulting group(s) select group(s) with max (c1+c2)
dt.tmp <- dt.tmp[, .SD[c1+c2 == max(c1+c2)], by = str1]
# In resulting group(s) select group(s) with max (d1+d2)
dt.tmp <- dt.tmp[, .SD[d1+d2 == max(d1+d2)], by = str1]
# In resulting group(s) select a random row
dt.tmp <- dt.tmp[, .SD[sample(.N, size = 1)], by = str1]
压缩成一个链:
dt.tmp[dt.tmp[, .N, by = .(str1, str2)][, .SD[N == max(N)], by = str1],
on = .(str1, str2)
][, .SD[c1+c2 == max(c1+c2)], by = str1
][, .SD[d1+d2 == max(d1+d2)], by = str1
][, .SD[sample(.N, size = 1)], by = str1
][, !"N"]
@sindri_baldur:我对你的回答做了进一步的优化。在大约一半的情况下,第一个分组给出了单行的组。通过将数据中的第一个分组拆分为单行和其余数据,一半的数据不需要进一步分组。它可以额外节省 10-20% 的计算时间
dt.tmp.N <- dt.tmp[, .N, by = .(BC, chrom,start,end)
][, .SD[N == max(N)], by = BC]
dt.tmp.1 <- dt.tmp[dt.tmp.N[N==1],on = .(BC, chrom,start,end)
][, .SD[sample(.N,1)], by = BC][,!"N"]
dt.tmp.Ng1 <- dt.tmp[dt.tmp.N[N>1],on = .(BC, chrom,start,end)
][, .SD[MAPQ1+MAPQ2 == max(MAPQ1+MAPQ2)], by = BC
][, .SD[AS1+AS2 == max(AS1+AS2)], by = BC
][, .SD[sample(.N,1)], by = BC
][,!"N"]
rbindlist(list(dt.tmp.1,dt.tmp.Ng1))
(Ps;我试着把它写成评论但是它太大了)
我正在处理带有基因组数据的大型 data.table(1e6-10e6 行,10 列)。我想通过将组减少为单行来减少数据。这种减少取决于多个列,但在连续的步骤中。示例数据如下:
dt.tmp <- data.table(str1=paste0("A",sample(1:100, 2000, replace=TRUE)),
str2=paste0("B",sample(1:5, 2000, replace=TRUE)),
c1=sample(1:3,2000, replace=T),
c2=sample(1:3,2000,replace=T),
d1=sample(1:2,2000,replace=T),
d2=sample(1:2,2000,replace=TRUE))
对于此数据,我想减少 str1 列,使用以下步骤:
- 在 str1 定义的组中,根据 str2 和 select 创建最大的组
- 结果组 select 组中最大 (c1+c2)
- 在结果组 select 组中最大 (d1+d2)
- 在结果组中 select 随机行
我尝试过各种在 .SD 上操作的组合,例如:
dt.tmp[,':='(c=c1+c2, d=d1+d2,rnd=sample.int(.N))
][,':='(n=.N),by=.(str1,str2)
][,.SD[n==max(n),
.SD[c==max(c),
.SD[d==max(d),
.SD[rnd==max(rnd)], by=d],
by=c],
by=n],
by=str1];
我最后一次尝试尽量减少使用 .SD:
dt.tmp[,':='(c=c1+c2, d=d1+d2, rnd=sample.int(.N))
][,':='(n=.N,cmaxidx=(c==max(c))),by=.(str1,str2)
][,':='(nmaxidx=(n==max(n))),by=str1
][,':='(dmaxidx=(d==max(d))),by=.(str1,str2,c)
][,.SD[dmaxidx&cmaxidx&nmaxidx
][rnd==max(rnd)], by=str1
][,':='(c=NULL,d=NULL,nmaxidx=NULL,cmaxidx=NULL,dmaxidx=NULL,n=NULL,rnd=NULL)][,.SD]
(后面的操作只是清理和打印)
我一点也不 "in to" data.table。是否有明显的优化我可以应用到上面 problem/code 以减少执行时间(目前我需要 200-300 左右 CPU 小时,在我们的服务器上使用最多 24 个内核减少到 14 左右的时钟小时) .
真实数据如下:
Classes 'data.table' and 'data.frame': 50259993 obs. of 26 variables:
$ BC : chr "AAAAAAAAAAAACAAGGTCG" "AAAAAAAAAAAACTACCGTG" "AAAAAAAAAAAAGCACTGAG" "AAAAAAAAAAAAGCACTGAG" ...
$ chrom : chr "chr2L" "chr2R" "chr2R" "chr2R" ...
$ start : int 22371281 12477441 8323580 8323580 17304870 31837917 24897443 22469324 22469324 18294732 ...
$ end : int 22371463 12477734 8323924 8323924 17305040 31838183 24897665 22469723 22469723 18295044 ...
$ strand : chr "+" "+" "-" "-" ...
$ MAPQ1 : int 1 40 42 42 42 42 24 1 1 42 ...
$ MAPQ2 : int 1 40 42 42 42 42 24 1 1 42 ...
$ AS1 : int -3 -33 0 -3 -12 -6 -39 0 0 0 ...
$ AS2 : int -12 -3 -18 -15 0 0 -3 -5 -20 -6 ...
$ XS1 : num -3 NA NA NA NA NA NA 0 0 NA ...
$ XS2 : num -12 NA NA NA NA NA NA 0 -15 NA ...
$ SNP_ABS_POS: chr "22371329,22371329,22371356,22371356,22371437" "12477460,12477500,12477524,12477707,12477719" "8323582,8323583,8323588,8323750,8323759,8323791,8323868,8323878" "8323582,8323583,8323588,8323750,8323759,8323791,8323868,8323878" ...
$ SNP_REL_POS: chr "48,48,75,75,156" "19,59,83,266,278" "2,3,8,170,179,211,288,298" "2,3,8,170,179,211,288,298" ...
$ SNP_ID : chr ".,.,.,.,." ".,.,.,.,." ".,.,.,.,.,.,.,." ".,.,.,.,.,.,.,." ...
$ SNP_SEQ : chr "CCCTTCATCGCACGAATGTGTGCGT,CCCTTCATCGCACGAATGTGAGCGT,A,A,T" "T,G,ACCGGCATCCATCCATCCAT,T,C" "T,T,ACG,A,G,G,C,T" "T,T,ACG,A,G,G,C,T" ...
$ SNP_VAR : chr "-3,-3,0,0,0" "0,-1,-2,-1,0" "1,1,-3,-2,-2,-2,-1,-1" "1,1,-3,-2,-2,-2,-1,-1" ...
$ SNP_PARENT : chr "unexpected,unexpected,expected,expected,expected" "expected,non_parental_allele,unread,non_parental_allele,expected" "expected,expected,unexpected,unread,unread,unread,non_parental_allele,non_parental_allele" "expected,expected,unexpected,unread,unread,unread,non_parental_allele,non_parental_allele" ...
$ SNP_TYPE : chr "indel,indel,snp,snp,snp" "snp,snp,indel,snp,snp" "snp,indel,indel,snp,snp,snp,snp,snp" "snp,indel,indel,snp,snp,snp,snp,snp" ...
$ SNP_SUBTYPE: chr "del,del,ts,ts,tv" "tv,tv,del,tv,ts" "tv,del,ins,tv,tv,tv,ts,tv" "tv,del,ins,tv,tv,tv,ts,tv" ...
- attr(*, ".internal.selfref")=<externalptr>
- attr(*, "sorted")= chr "BC" "chrom" "start" "end"
其中 BC=str1,chrom+start+end=str2,MAPQ1/2=c1/2,AS1/2=d1/2。此数据减少到大约 20e6 行。
输入数据按 chrom、start、end 排序。是否有使用特定排序的有利方法?
我是否正确地认为使用 .SD 需要额外的内存(尽管内存并不是真正的问题 atm)因此不是最佳的?
如有任何帮助和指点,我们将不胜感激。
会话信息:
R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.6 LTS
Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
locale:
[1] C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.12.2
loaded via a namespace (and not attached):
[1] compiler_3.6.1 R.methodsS3_1.7.1 R.utils_2.8.0 R.oo_1.22.0
分解为几个步骤:
# Within group defined by str1 create groups based on str2 and select the largest group(s)
combinations2keep <- dt.tmp[, .N, by = .(str1, str2)
][, .SD[N == max(N)], by = str1
][, !"N"]
dt.tmp <- dt.tmp[combinations2keep, on = .(str1, str2)]
# In resulting group(s) select group(s) with max (c1+c2)
dt.tmp <- dt.tmp[, .SD[c1+c2 == max(c1+c2)], by = str1]
# In resulting group(s) select group(s) with max (d1+d2)
dt.tmp <- dt.tmp[, .SD[d1+d2 == max(d1+d2)], by = str1]
# In resulting group(s) select a random row
dt.tmp <- dt.tmp[, .SD[sample(.N, size = 1)], by = str1]
压缩成一个链:
dt.tmp[dt.tmp[, .N, by = .(str1, str2)][, .SD[N == max(N)], by = str1],
on = .(str1, str2)
][, .SD[c1+c2 == max(c1+c2)], by = str1
][, .SD[d1+d2 == max(d1+d2)], by = str1
][, .SD[sample(.N, size = 1)], by = str1
][, !"N"]
@sindri_baldur:我对你的回答做了进一步的优化。在大约一半的情况下,第一个分组给出了单行的组。通过将数据中的第一个分组拆分为单行和其余数据,一半的数据不需要进一步分组。它可以额外节省 10-20% 的计算时间
dt.tmp.N <- dt.tmp[, .N, by = .(BC, chrom,start,end)
][, .SD[N == max(N)], by = BC]
dt.tmp.1 <- dt.tmp[dt.tmp.N[N==1],on = .(BC, chrom,start,end)
][, .SD[sample(.N,1)], by = BC][,!"N"]
dt.tmp.Ng1 <- dt.tmp[dt.tmp.N[N>1],on = .(BC, chrom,start,end)
][, .SD[MAPQ1+MAPQ2 == max(MAPQ1+MAPQ2)], by = BC
][, .SD[AS1+AS2 == max(AS1+AS2)], by = BC
][, .SD[sample(.N,1)], by = BC
][,!"N"]
rbindlist(list(dt.tmp.1,dt.tmp.Ng1))
(Ps;我试着把它写成评论但是它太大了)