在数据框中查找重叠范围并为其分配值
Find overlapping ranges in a dataframe and assign them values
原始 的更简单版本,我问过,但还没有人回答。
我有一个巨大的输入文件(其中的代表性示例如下所示 input
):
> input
CT1 CT2 CT3
1 chr1:200-400 chr1:250-450 chr1:400-800
2 chr1:800-970 chr2:200-500 chr1:700-870
3 chr2:300-700 chr2:600-1000 chr2:700-1400
我想按照规则(如下所述)处理它,这样我得到一个 output
像:
> output
CT1 CT2 CT3
chr1:200-400 1 1 0
chr1:800-970 1 0 1
chr2:300-700 1 1 0
chr1:250-450 1 1 1
chr2:200-500 1 1 0
chr2:600-1000 1 1 1
chr1:400-800 0 1 1
chr1:700-870 1 0 1
chr2:700-1400 0 1 1
规则:
获取数据帧的每个索引(在这种情况下第一个是 chr1:200-400
),看看它是否与数据帧中的任何其他值重叠。如果有,就在它所在的那一栏下面写上 1
,如果没有,就写上 0
.
例如,如果我们取输入的第一个索引 input[1,1]
,即 chr1:200-400
。由于它存在于第 1 列中,我们将在其下方写上 1。现在我们将检查此范围是否与 input
中任何其他列中存在的任何其他范围重叠。该值仅与第二列 (CT2
) 的第一个值 (chr1:250-450
) 重叠,因此,我们也在其下方写上 1。由于与 CT3
中的任何值都没有重叠,我们在输出数据帧中的 CT3
下面写 0
。
这里是 input
和 output
的输出:
> dput(input)
structure(list(CT1 = structure(1:3, .Label = c("chr1:200-400",
"chr1:800-970", "chr2:300-700"), class = "factor"), CT2 = structure(1:3, .Label = c("chr1:250-450",
"chr2:200-500", "chr2:600-1000"), class = "factor"), CT3 = structure(1:3, .Label = c("chr1:400-800",
"chr1:700-870", "chr2:700-1400"), class = "factor")), .Names = c("CT1",
"CT2", "CT3"), class = "data.frame", row.names = c(NA, -3L))
> dput(output)
structure(list(CT1 = c(1L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L), CT2 = c(1L,
0L, 1L, 1L, 1L, 1L, 0L, 1L, 1L), CT3 = c(0L, 0L, 0L, 0L, 0L,
1L, 1L, 1L, 1L)), .Names = c("CT1", "CT2", "CT3"), class = "data.frame", row.names = c("chr1:200-400",
"chr1:800-970", "chr2:300-700", "chr1:250-450", "chr2:200-500",
"chr2:600-1000", "chr1:400-800", "chr1:700-870", "chr2:700-1400"
))
使用 data.table
-package 的可能解决方案:
# load the 'data.table'-package and convert 'input' to a data.table with 'setDT'
library(data.table)
setDT(input)
# reshape 'input' to long format and split the strings in 3 columns
DT <- melt(input, measure.vars = 1:3)[, c('chr','low','high') := tstrsplit(value, split = ':|-', type.convert = TRUE)
, by = variable][]
# create aggregation function; needed in the ast reshape step
f <- function(x) as.integer(length(x) > 0)
# cartesian self join & reshape result back to wide format with aggregation function
DT[DT, on = .(chr, low < high, high > low), allow.cartesian = TRUE
][, dcast(.SD, value ~ i.variable, fun = f)]
给出:
value CT1 CT2 CT3
1: chr1:200-400 1 1 0
2: chr1:250-450 1 1 1
3: chr1:400-800 0 1 1
4: chr1:700-870 1 0 1
5: chr1:800-970 1 0 1
6: chr2:200-500 1 1 0
7: chr2:300-700 1 1 0
8: chr2:600-1000 1 1 1
9: chr2:700-1400 0 1 1
原始
我有一个巨大的输入文件(其中的代表性示例如下所示 input
):
> input
CT1 CT2 CT3
1 chr1:200-400 chr1:250-450 chr1:400-800
2 chr1:800-970 chr2:200-500 chr1:700-870
3 chr2:300-700 chr2:600-1000 chr2:700-1400
我想按照规则(如下所述)处理它,这样我得到一个 output
像:
> output
CT1 CT2 CT3
chr1:200-400 1 1 0
chr1:800-970 1 0 1
chr2:300-700 1 1 0
chr1:250-450 1 1 1
chr2:200-500 1 1 0
chr2:600-1000 1 1 1
chr1:400-800 0 1 1
chr1:700-870 1 0 1
chr2:700-1400 0 1 1
规则:
获取数据帧的每个索引(在这种情况下第一个是 chr1:200-400
),看看它是否与数据帧中的任何其他值重叠。如果有,就在它所在的那一栏下面写上 1
,如果没有,就写上 0
.
例如,如果我们取输入的第一个索引 input[1,1]
,即 chr1:200-400
。由于它存在于第 1 列中,我们将在其下方写上 1。现在我们将检查此范围是否与 input
中任何其他列中存在的任何其他范围重叠。该值仅与第二列 (CT2
) 的第一个值 (chr1:250-450
) 重叠,因此,我们也在其下方写上 1。由于与 CT3
中的任何值都没有重叠,我们在输出数据帧中的 CT3
下面写 0
。
这里是 input
和 output
的输出:
> dput(input)
structure(list(CT1 = structure(1:3, .Label = c("chr1:200-400",
"chr1:800-970", "chr2:300-700"), class = "factor"), CT2 = structure(1:3, .Label = c("chr1:250-450",
"chr2:200-500", "chr2:600-1000"), class = "factor"), CT3 = structure(1:3, .Label = c("chr1:400-800",
"chr1:700-870", "chr2:700-1400"), class = "factor")), .Names = c("CT1",
"CT2", "CT3"), class = "data.frame", row.names = c(NA, -3L))
> dput(output)
structure(list(CT1 = c(1L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L), CT2 = c(1L,
0L, 1L, 1L, 1L, 1L, 0L, 1L, 1L), CT3 = c(0L, 0L, 0L, 0L, 0L,
1L, 1L, 1L, 1L)), .Names = c("CT1", "CT2", "CT3"), class = "data.frame", row.names = c("chr1:200-400",
"chr1:800-970", "chr2:300-700", "chr1:250-450", "chr2:200-500",
"chr2:600-1000", "chr1:400-800", "chr1:700-870", "chr2:700-1400"
))
使用 data.table
-package 的可能解决方案:
# load the 'data.table'-package and convert 'input' to a data.table with 'setDT'
library(data.table)
setDT(input)
# reshape 'input' to long format and split the strings in 3 columns
DT <- melt(input, measure.vars = 1:3)[, c('chr','low','high') := tstrsplit(value, split = ':|-', type.convert = TRUE)
, by = variable][]
# create aggregation function; needed in the ast reshape step
f <- function(x) as.integer(length(x) > 0)
# cartesian self join & reshape result back to wide format with aggregation function
DT[DT, on = .(chr, low < high, high > low), allow.cartesian = TRUE
][, dcast(.SD, value ~ i.variable, fun = f)]
给出:
value CT1 CT2 CT3 1: chr1:200-400 1 1 0 2: chr1:250-450 1 1 1 3: chr1:400-800 0 1 1 4: chr1:700-870 1 0 1 5: chr1:800-970 1 0 1 6: chr2:200-500 1 1 0 7: chr2:300-700 1 1 0 8: chr2:600-1000 1 1 1 9: chr2:700-1400 0 1 1