重叠 data.table ||错误 y 的键必须与 by.y 中指定的列相同
foverlaps data.table || error y's key must be identical to the columns specified in by.y
我有两个数据框。一个有两列,另一个有三列。第一个数据框有 SNP 名称及其位置。第二个三列数据框包含基因名称和基因起始和结束坐标的列。
我有兴趣根据边界执行连接。如果 SNP 落在基因边界内 return 它
dt_snp<-data.table("SNP"=c(paste("SNP",seq(1:10),sep="")),
"BP"=c(1100, 89200, 2500, 33000, 5500, 69500, 12000,8800, 23200, 27000 )) ## SNP data
dt_gene<-data.table("GENE"=c("GENE1","GENE2","GENE3","GENE4","GENE5"),
"START"=c(1000,2100,5000,40000,23000), "END"=c(2000,3000,9000,45000,30000)) ## Gene data
## do a join using data.table
snp_withingenes<-dt_snp[dt_gene, c("SNP","BP","GENE","START","END"), on=.(BP>=START, BP<=END), nomatch=0] # inner join
我用它得到了想要的结果,但是当我在存储在 R 包中的 R 脚本中执行此任务时,我收到 .
运算符的警告。警告如下:
function_small: no visible global function definition for ‘.’
Undefined global functions or variables:
.
因此我想使用 foverlaps
但我很难理解它并用它达到预期的结果。这对我来说是违反直觉的
foverlaps(dt_snp,dt_gene, by.x=c("SNP","BP"), by.y=c("GENE","START","END"), nomatch=NA, type="any")
Error in foverlaps(dt_snp, dt_gene, by.x = c("SNP", "BP"), by.y = c("GENE", :
The first 3 columns of y's key must be identical to the columns specified in by.y.
我应该如何获得所需的输出?
data.table_1.13.0
R v4.0
windows 平台
来自 devtools 的 check
对 R v4.0
rmarkdown_2.3
devtools_2.3.1
UNIX 平台
上的 .
操作员造成困扰
为了扩展我的评论,这里是 foverlaps 选项,它在两个 data.tables
中都需要两列,因此这里似乎不是最优的:
library(data.table)
dt_snp <- data.table("SNP"=c(paste("SNP",seq(1:10),sep="")),
"BP"=c(1100, 89200, 2500, 33000, 5500, 69500, 12000,8800, 23200, 27000 )) ## SNP data
dt_gene <- data.table("GENE"=c("GENE1","GENE2","GENE3","GENE4","GENE5"),
"START"=c(1000,2100,5000,40000,23000), "END"=c(2000,3000,9000,45000,30000)) ## Gene data
setkey(dt_gene, START, END)
dt_snp[, BP2 := BP]
## do a join using data.table
dt_snp[dt_gene, c("SNP","BP","GENE","START","END"), on=list(BP2 >= START, BP2 <= END), nomatch=0][]
#> SNP BP GENE START END
#> 1: SNP1 1100 GENE1 1000 2000
#> 2: SNP3 2500 GENE2 2100 3000
#> 3: SNP5 5500 GENE3 5000 9000
#> 4: SNP8 8800 GENE3 5000 9000
#> 5: SNP9 23200 GENE5 23000 30000
#> 6: SNP10 27000 GENE5 23000 30000
setkey(dt_snp, BP, BP2)
foverlaps(dt_snp,dt_gene, by.x=c("BP", "BP2"), by.y=c("START","END"), nomatch=NULL, type="any")[, BP2 := NULL][]
#> GENE START END SNP BP
#> 1: GENE1 1000 2000 SNP1 1100
#> 2: GENE2 2100 3000 SNP3 2500
#> 3: GENE3 5000 9000 SNP5 5500
#> 4: GENE3 5000 9000 SNP8 8800
#> 5: GENE5 23000 30000 SNP9 23200
#> 6: GENE5 23000 30000 SNP10 27000
由 reprex package (v0.3.0)
于 2020-08-06 创建
我有两个数据框。一个有两列,另一个有三列。第一个数据框有 SNP 名称及其位置。第二个三列数据框包含基因名称和基因起始和结束坐标的列。
我有兴趣根据边界执行连接。如果 SNP 落在基因边界内 return 它
dt_snp<-data.table("SNP"=c(paste("SNP",seq(1:10),sep="")),
"BP"=c(1100, 89200, 2500, 33000, 5500, 69500, 12000,8800, 23200, 27000 )) ## SNP data
dt_gene<-data.table("GENE"=c("GENE1","GENE2","GENE3","GENE4","GENE5"),
"START"=c(1000,2100,5000,40000,23000), "END"=c(2000,3000,9000,45000,30000)) ## Gene data
## do a join using data.table
snp_withingenes<-dt_snp[dt_gene, c("SNP","BP","GENE","START","END"), on=.(BP>=START, BP<=END), nomatch=0] # inner join
我用它得到了想要的结果,但是当我在存储在 R 包中的 R 脚本中执行此任务时,我收到 .
运算符的警告。警告如下:
function_small: no visible global function definition for ‘.’
Undefined global functions or variables:
.
因此我想使用 foverlaps
但我很难理解它并用它达到预期的结果。这对我来说是违反直觉的
foverlaps(dt_snp,dt_gene, by.x=c("SNP","BP"), by.y=c("GENE","START","END"), nomatch=NA, type="any")
Error in foverlaps(dt_snp, dt_gene, by.x = c("SNP", "BP"), by.y = c("GENE", :
The first 3 columns of y's key must be identical to the columns specified in by.y.
我应该如何获得所需的输出?
data.table_1.13.0
R v4.0
windows 平台
来自 devtools 的 check
对 R v4.0
rmarkdown_2.3
devtools_2.3.1
UNIX 平台
.
操作员造成困扰
为了扩展我的评论,这里是 foverlaps 选项,它在两个 data.tables
中都需要两列,因此这里似乎不是最优的:
library(data.table)
dt_snp <- data.table("SNP"=c(paste("SNP",seq(1:10),sep="")),
"BP"=c(1100, 89200, 2500, 33000, 5500, 69500, 12000,8800, 23200, 27000 )) ## SNP data
dt_gene <- data.table("GENE"=c("GENE1","GENE2","GENE3","GENE4","GENE5"),
"START"=c(1000,2100,5000,40000,23000), "END"=c(2000,3000,9000,45000,30000)) ## Gene data
setkey(dt_gene, START, END)
dt_snp[, BP2 := BP]
## do a join using data.table
dt_snp[dt_gene, c("SNP","BP","GENE","START","END"), on=list(BP2 >= START, BP2 <= END), nomatch=0][]
#> SNP BP GENE START END
#> 1: SNP1 1100 GENE1 1000 2000
#> 2: SNP3 2500 GENE2 2100 3000
#> 3: SNP5 5500 GENE3 5000 9000
#> 4: SNP8 8800 GENE3 5000 9000
#> 5: SNP9 23200 GENE5 23000 30000
#> 6: SNP10 27000 GENE5 23000 30000
setkey(dt_snp, BP, BP2)
foverlaps(dt_snp,dt_gene, by.x=c("BP", "BP2"), by.y=c("START","END"), nomatch=NULL, type="any")[, BP2 := NULL][]
#> GENE START END SNP BP
#> 1: GENE1 1000 2000 SNP1 1100
#> 2: GENE2 2100 3000 SNP3 2500
#> 3: GENE3 5000 9000 SNP5 5500
#> 4: GENE3 5000 9000 SNP8 8800
#> 5: GENE5 23000 30000 SNP9 23200
#> 6: GENE5 23000 30000 SNP10 27000
由 reprex package (v0.3.0)
于 2020-08-06 创建