如何从两个数据表中删除重叠序列?
How to remove overlapping sequences from two datatables?
我有两个 data.tables 提供跨不同染色体(类别)的序列坐标。例如:
library(data.table)
dt1 <- data.table(chromosome = c("1", "1", "1", "1", "X"),
start = c(1, 50, 110, 150, 110),
end = c(11, 100, 121, 200, 200))
dt2 <- data.table(chromosome = c("1", "1", "X"),
start = c(12, 60, 50),
end = c(20, 115, 80))
我需要创建第三个 data.table,它为包含 dt1 中所有整数的序列提供坐标,这些整数不与 dt2 中的序列中的任何整数重叠。例如:
dt3 <- data.table(chromosome = c("1", "1", "1", "1", "X"),
start = c(1, 50, 116, 150, 110),
end = c(11, 59, 121, 200, 200))
我需要 运行 这个 data.tables 非常大,因此我需要最大化性能。我尝试使用 foverlaps() 函数但无济于事。任何帮助将不胜感激!
您可以从 foverlaps
中的类似内容开始
setkey(dt2,chromosome,start,end)
ds = foverlaps(dt1,dt2, type="any")
ds[,.(chromosome,
start = fcase(is.na(start) | i.start <= start,i.start,
i.end >= end, end + 1),
end = fcase(is.na(end) | i.end >= end, i.end,
i.start <= start, start - 1)
)]
# chromosome start end
# <char> <num> <num>
#1: 1 1 11
#2: 1 50 59
#3: 1 116 121
#4: 1 150 200
#5: X 110 200
为了完整起见,使用 Bioconductor 的 GenomicRanges
包有一个简洁的解决方案:
library(GenomicRanges)
setdiff(makeGRangesFromDataFrame(dt1), makeGRangesFromDataFrame(dt2))
GRanges object with 5 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] 1 1-11 *
[2] 1 50-59 *
[3] 1 116-121 *
[4] 1 150-200 *
[5] X 110-200 *
-------
seqinfo: 2 sequences from an unspecified genome; no seqlengths
如果结果需要 class data.table
:
library(data.table) # development version 1.14.3 used
library(GenomicRanges)
setdiff(makeGRangesFromDataFrame(dt1), makeGRangesFromDataFrame(dt2)) |>
as.data.table() |>
DT(, .(chromosome = seqnames, start, end))
chromosome start end
<fctr> <int> <int>
1: 1 1 11
2: 1 50 59
3: 1 116 121
4: 1 150 200
5: X 110 200
作为, the GenomicRanges
package is not available from CRAN. Waldi has provided the link to the installation guide in the BiocManager
vignette。这是简短的版本:
install.packages("BiocManager")
BiocManager::install("GenomicRanges")
我有两个 data.tables 提供跨不同染色体(类别)的序列坐标。例如:
library(data.table)
dt1 <- data.table(chromosome = c("1", "1", "1", "1", "X"),
start = c(1, 50, 110, 150, 110),
end = c(11, 100, 121, 200, 200))
dt2 <- data.table(chromosome = c("1", "1", "X"),
start = c(12, 60, 50),
end = c(20, 115, 80))
我需要创建第三个 data.table,它为包含 dt1 中所有整数的序列提供坐标,这些整数不与 dt2 中的序列中的任何整数重叠。例如:
dt3 <- data.table(chromosome = c("1", "1", "1", "1", "X"),
start = c(1, 50, 116, 150, 110),
end = c(11, 59, 121, 200, 200))
我需要 运行 这个 data.tables 非常大,因此我需要最大化性能。我尝试使用 foverlaps() 函数但无济于事。任何帮助将不胜感激!
您可以从 foverlaps
setkey(dt2,chromosome,start,end)
ds = foverlaps(dt1,dt2, type="any")
ds[,.(chromosome,
start = fcase(is.na(start) | i.start <= start,i.start,
i.end >= end, end + 1),
end = fcase(is.na(end) | i.end >= end, i.end,
i.start <= start, start - 1)
)]
# chromosome start end
# <char> <num> <num>
#1: 1 1 11
#2: 1 50 59
#3: 1 116 121
#4: 1 150 200
#5: X 110 200
为了完整起见,使用 Bioconductor 的 GenomicRanges
包有一个简洁的解决方案:
library(GenomicRanges)
setdiff(makeGRangesFromDataFrame(dt1), makeGRangesFromDataFrame(dt2))
GRanges object with 5 ranges and 0 metadata columns: seqnames ranges strand <Rle> <IRanges> <Rle> [1] 1 1-11 * [2] 1 50-59 * [3] 1 116-121 * [4] 1 150-200 * [5] X 110-200 * ------- seqinfo: 2 sequences from an unspecified genome; no seqlengths
如果结果需要 class data.table
:
library(data.table) # development version 1.14.3 used
library(GenomicRanges)
setdiff(makeGRangesFromDataFrame(dt1), makeGRangesFromDataFrame(dt2)) |>
as.data.table() |>
DT(, .(chromosome = seqnames, start, end))
chromosome start end <fctr> <int> <int> 1: 1 1 11 2: 1 50 59 3: 1 116 121 4: 1 150 200 5: X 110 200
作为GenomicRanges
package is not available from CRAN. Waldi has provided the link to the installation guide in the BiocManager
vignette。这是简短的版本:
install.packages("BiocManager")
BiocManager::install("GenomicRanges")