根据范围重叠处理输入文件
Processing the input file based on range overlap
我有一个巨大的输入文件(其中的代表性示例如下所示 input
):
> input
CT1 CT2 CT3
1 chr1:200-400 chr1:250-450 chr1:400-800
2 chr1:800-970 chr2:200-500 chr1:700-870
3 chr2:300-700 chr2:600-1000 chr2:700-1400
我想通过遵循一些规则(如下所述)来处理它,以便我得到一个 output
像:
> output
CT1 CT2 CT3
chr1:200-400 1 1 0
chr1:800-970 1 0 0
chr2:300-700 1 1 0
chr1:250-450 1 1 0
chr2:200-500 1 1 0
chr2:600-1000 0 1 1
chr1:400-800 0 0 1
chr1:700-870 0 1 1
chr2:700-1400 0 1 1
规则:
获取每个索引(在这种情况下第一个是 chr1:200-400
),看看它是否与其他列中的值显着重叠。从显着性来看,我的意思是范围至少有 50% 的重叠。如果有,就在它所在的那一栏下面写上 1
,如果没有,就写上 0
.
现在我解释一下我是如何得到输出的 table。从我们的输入中,我们采用第一个索引输入[1,1],即chr1:200-400
。由于它存在于第 1 列中,我们将在其下方写上 1。现在我们将检查任何其他列中是否存在此范围或重叠范围。此值仅与第二列 (CT2
) 的第一个值 (chr1:250-450
) 重叠。现在我们检查这种重叠是否显着。我们知道范围是 200 (chr1:200-400),与第二列值(chr1:250-450)的重叠是 150 (250-400)。由于 150 的重叠大于原始范围 (200-400 = 200) 的一半(原始范围的 50% = 100)OR 重叠范围 (250-450 = 200)。我们认为它是重叠的,并在 CT2 列下方分配 1。由于这个范围不与CT3
中的任何值重叠,我们在CT3
下面写0。
同样对于输出的第 9 行。 chr2:700-1400
在 CT1
中不存在,所以在下面写 0。对于 CT2
,它与 chr2:600-1000
重叠。这里原来的范围是700(chr2:700-1400),一半是350,和CT2
的chr2:700-1000
重叠的是300(超出实际范围chr2:600-1000)。现在这个 300 的重叠不大于实际范围 700 的一半(CT3 的 chr2:700-1400),但它大于重叠范围 400 的一半(chr2:600-1000 of CT2)。因此,我们认为它是重叠的,并在 CT2 下面写上 1。由于这个范围在CT3中确实存在,所以我们在其下方也写上1。
这里是输入输出的dput:
> dput(input)
structure(list(CT1 = structure(1:3, .Label = c("chr1:200-400",
"chr1:800-970", "chr2:300-700"), class = "factor"), CT2 = structure(1:3, .Label = c("chr1:250-450",
"chr2:200-500", "chr2:600-1000"), class = "factor"), CT3 = structure(1:3, .Label = c("chr1:400-800",
"chr1:700-870", "chr2:700-1400"), class = "factor")), .Names = c("CT1",
"CT2", "CT3"), class = "data.frame", row.names = c(NA, -3L))
> dput(output)
structure(list(CT1 = c(1L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L), CT2 = c(1L,
0L, 1L, 1L, 1L, 1L, 0L, 1L, 1L), CT3 = c(0L, 0L, 0L, 0L, 0L,
1L, 1L, 1L, 1L)), .Names = c("CT1", "CT2", "CT3"), class = "data.frame", row.names = c("chr1:200-400",
"chr1:800-970", "chr2:300-700", "chr1:250-450", "chr2:200-500",
"chr2:600-1000", "chr1:400-800", "chr1:700-870", "chr2:700-1400"
))
要做到这一点需要很多步骤,以及 data.table
包中的一些概念,最值得注意的是非等值连接。我在整个过程中都对代码进行了注释,如果您想要更深入的了解,建议 运行 逐步执行:
library(data.table)
input <- structure(list(CT1 = structure(1:3, .Label =
c("chr1:200-400", "chr1:800-970", "chr2:300-700"), class =
"factor"), CT2 = structure(1:3, .Label = c("chr1:250-450",
"chr2:200-500", "chr2:600-1000"), class = "factor"), CT3 =
structure(1:3, .Label = c("chr1:400-800", "chr1:700-870",
"chr2:700-1400"), class = "factor")), .Names = c("CT1",
"CT2", "CT3"), class = "data.frame", row.names = c(NA, -3L))
output <- structure(list(CT1 = c(1L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L),
CT2 = c(1L, 0L, 1L, 1L, 1L, 1L, 0L, 1L, 1L), CT3 = c(0L, 0L, 0L,
0L, 0L, 1L, 1L, 1L, 1L)), .Names = c("CT1", "CT2", "CT3"), class =
"data.frame", row.names = c("chr1:200-400", "chr1:800-970",
"chr2:300-700", "chr1:250-450", "chr2:200-500", "chr2:600-1000",
"chr1:400-800", "chr1:700-870", "chr2:700-1400"))
# Builds a data.table by breaking a string like "chr1:300-700" into
# three columns: chr, start, and end.
split_genomic_range <- function(str) {
chr <- gsub(":.*", "", str)
start <- gsub("-.*", "", gsub(".*:", "", str))
end <- gsub(".*-", "", str)
start <- as.numeric(start)
end <- as.numeric(end)
return(data.table(chr=chr, start=start, end=end))
}
# First break the input data.table into three new tables - we will need
# to perform non-equi joins of the index table (column CT1 in input) to
# the tables built from the other two columns.
ct1 <- split_genomic_range(input$CT1)
ct2 <- split_genomic_range(input$CT2)
ct3 <- split_genomic_range(input$CT3)
# Create an index table with all genomic ranges, then check for
# overlaps in each of the three tables created from the input
# columns:
index_table <- unique(rbind(ct1, ct2, ct3))
# Returns entries from the index_table if they overlap > 50% any
# entries in the lookup table or vice-versa
get_overlapping_ranges <- function(index_table, lookup_table) {
# This function does two non-equi joins. First, it checks whether
# any entries in the index_table have a 50% overlap with any
# entries in the lookup table. Second, it does the reverse, and
# checks whether any entries in the lookup_table have a 50% overlap
# with any entries in the index_table. This is required due to
# differing window sizes:
# e.g. 0-20 significantly overlaps 10-100, but 10-100 does not
# significantly overlap 0-20.
# We will need to create a "middle" column for each genomic range.
# We will need to create copies of each table to do this, otherwise
# they will end up with this new column as a side effect of the
# function call.
index_copy <- copy(index_table)
lookup_copy <- copy(lookup_table)
index_copy[, middle := start + (end - start) * 0.5]
lookup_copy[, middle := start + (end - start) * 0.5]
# In the index_copy we will also need to create dummy columns for
# the check. We need to do this so we can find the appropriate
# genomic window from the index table when we do the second
# non-equi join, otherwise the start and end columns will be
# clobbered.
index_copy[, start_index := start]
index_copy[, end_index := end]
# If the middle of a genomic range in the index table falls within
# a genomic range in the lookup table, then that genomic entry from
# the index table has a significant overlap with the corresponding
# in the lookup table
index_overlaps <- index_copy[lookup_table,
on=.(chr, middle >= start, middle <= end),
nomatch=0]
# Test the reverse: does any entry in the lookup table
# significantly overlap with any of the genomic windows in the
# index table?
lookup_overlaps <- index_copy[lookup_copy,
on=.(chr, start_index <= middle, end_index >= middle),
nomatch=0]
# Remove extra columns created by the non-equi join:
index_overlaps <- index_overlaps[,.(chr, start, end)]
lookup_overlaps <- lookup_overlaps[,.(chr, start, end)]
# Combine results and remove any duplicates that arise, e.g.
# because a genomic window in the index_table significantly
# overlaps with multiple genomic windows in the lookup table, or
# because the overlap is significant in both comparisons (i.e.
# where both windows are the same size).
overlaps <- unique(rbind(index_overlaps, lookup_overlaps))
return(overlaps)
}
ranges_in_ct1 <- get_overlapping_ranges(index_table, ct1)
ranges_in_ct2 <- get_overlapping_ranges(index_table, ct2)
ranges_in_ct3 <- get_overlapping_ranges(index_table, ct3)
# Combine results, indicating which column each genomic range was
# found to overlap:
overlaps <- rbind(
CT1=ranges_in_ct1, CT2=ranges_in_ct2, CT3=ranges_in_ct3,
idcol="input_column"
)
# Recombine the chr, start, and end columns to the original format:
overlaps[, genomic_window := paste0(chr, ":", start, "-", end)]
overlaps[, c("chr", "start", "end") := NULL]
# Convert to the wide format, so that each input column either has a
# 1 or 0 if the genomic window overlaps with 50% any other found in
# that column
overlaps <- dcast(overlaps, genomic_window ~ input_column,
fun.aggregate = length)
# Convert back to a data.frame:
overlaps <- as.data.frame(overlaps)
rownames(overlaps) <- overlaps$genomic_window
overlaps <- overlaps[,-1]
# Reorder so we can double check against the desired output:
overlaps <- overlaps[rownames(output),]
print(overlaps)
这将生成(几乎)与您提供的相同的输出:
CT1 CT2 CT3
chr1:200-400 1 1 0
chr1:800-970 1 0 0
chr2:300-700 1 1 0
chr1:250-450 1 1 0
chr2:200-500 1 1 0
chr2:600-1000 0 1 1
chr1:400-800 0 0 1
chr1:700-870 0 0 1
chr2:700-1400 0 1 1
唯一的区别是 chr1:700-870 在 CT2 列中有一个 0:这是因为它实际上不与 CT2 中的任何基因组 windows 重叠,唯一的其他 window 在 1 号染色体上是 chr1:250-450.
我有一个巨大的输入文件(其中的代表性示例如下所示 input
):
> input
CT1 CT2 CT3
1 chr1:200-400 chr1:250-450 chr1:400-800
2 chr1:800-970 chr2:200-500 chr1:700-870
3 chr2:300-700 chr2:600-1000 chr2:700-1400
我想通过遵循一些规则(如下所述)来处理它,以便我得到一个 output
像:
> output
CT1 CT2 CT3
chr1:200-400 1 1 0
chr1:800-970 1 0 0
chr2:300-700 1 1 0
chr1:250-450 1 1 0
chr2:200-500 1 1 0
chr2:600-1000 0 1 1
chr1:400-800 0 0 1
chr1:700-870 0 1 1
chr2:700-1400 0 1 1
规则:
获取每个索引(在这种情况下第一个是 chr1:200-400
),看看它是否与其他列中的值显着重叠。从显着性来看,我的意思是范围至少有 50% 的重叠。如果有,就在它所在的那一栏下面写上 1
,如果没有,就写上 0
.
现在我解释一下我是如何得到输出的 table。从我们的输入中,我们采用第一个索引输入[1,1],即chr1:200-400
。由于它存在于第 1 列中,我们将在其下方写上 1。现在我们将检查任何其他列中是否存在此范围或重叠范围。此值仅与第二列 (CT2
) 的第一个值 (chr1:250-450
) 重叠。现在我们检查这种重叠是否显着。我们知道范围是 200 (chr1:200-400),与第二列值(chr1:250-450)的重叠是 150 (250-400)。由于 150 的重叠大于原始范围 (200-400 = 200) 的一半(原始范围的 50% = 100)OR 重叠范围 (250-450 = 200)。我们认为它是重叠的,并在 CT2 列下方分配 1。由于这个范围不与CT3
中的任何值重叠,我们在CT3
下面写0。
同样对于输出的第 9 行。 chr2:700-1400
在 CT1
中不存在,所以在下面写 0。对于 CT2
,它与 chr2:600-1000
重叠。这里原来的范围是700(chr2:700-1400),一半是350,和CT2
的chr2:700-1000
重叠的是300(超出实际范围chr2:600-1000)。现在这个 300 的重叠不大于实际范围 700 的一半(CT3 的 chr2:700-1400),但它大于重叠范围 400 的一半(chr2:600-1000 of CT2)。因此,我们认为它是重叠的,并在 CT2 下面写上 1。由于这个范围在CT3中确实存在,所以我们在其下方也写上1。
这里是输入输出的dput:
> dput(input)
structure(list(CT1 = structure(1:3, .Label = c("chr1:200-400",
"chr1:800-970", "chr2:300-700"), class = "factor"), CT2 = structure(1:3, .Label = c("chr1:250-450",
"chr2:200-500", "chr2:600-1000"), class = "factor"), CT3 = structure(1:3, .Label = c("chr1:400-800",
"chr1:700-870", "chr2:700-1400"), class = "factor")), .Names = c("CT1",
"CT2", "CT3"), class = "data.frame", row.names = c(NA, -3L))
> dput(output)
structure(list(CT1 = c(1L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L), CT2 = c(1L,
0L, 1L, 1L, 1L, 1L, 0L, 1L, 1L), CT3 = c(0L, 0L, 0L, 0L, 0L,
1L, 1L, 1L, 1L)), .Names = c("CT1", "CT2", "CT3"), class = "data.frame", row.names = c("chr1:200-400",
"chr1:800-970", "chr2:300-700", "chr1:250-450", "chr2:200-500",
"chr2:600-1000", "chr1:400-800", "chr1:700-870", "chr2:700-1400"
))
要做到这一点需要很多步骤,以及 data.table
包中的一些概念,最值得注意的是非等值连接。我在整个过程中都对代码进行了注释,如果您想要更深入的了解,建议 运行 逐步执行:
library(data.table)
input <- structure(list(CT1 = structure(1:3, .Label =
c("chr1:200-400", "chr1:800-970", "chr2:300-700"), class =
"factor"), CT2 = structure(1:3, .Label = c("chr1:250-450",
"chr2:200-500", "chr2:600-1000"), class = "factor"), CT3 =
structure(1:3, .Label = c("chr1:400-800", "chr1:700-870",
"chr2:700-1400"), class = "factor")), .Names = c("CT1",
"CT2", "CT3"), class = "data.frame", row.names = c(NA, -3L))
output <- structure(list(CT1 = c(1L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L),
CT2 = c(1L, 0L, 1L, 1L, 1L, 1L, 0L, 1L, 1L), CT3 = c(0L, 0L, 0L,
0L, 0L, 1L, 1L, 1L, 1L)), .Names = c("CT1", "CT2", "CT3"), class =
"data.frame", row.names = c("chr1:200-400", "chr1:800-970",
"chr2:300-700", "chr1:250-450", "chr2:200-500", "chr2:600-1000",
"chr1:400-800", "chr1:700-870", "chr2:700-1400"))
# Builds a data.table by breaking a string like "chr1:300-700" into
# three columns: chr, start, and end.
split_genomic_range <- function(str) {
chr <- gsub(":.*", "", str)
start <- gsub("-.*", "", gsub(".*:", "", str))
end <- gsub(".*-", "", str)
start <- as.numeric(start)
end <- as.numeric(end)
return(data.table(chr=chr, start=start, end=end))
}
# First break the input data.table into three new tables - we will need
# to perform non-equi joins of the index table (column CT1 in input) to
# the tables built from the other two columns.
ct1 <- split_genomic_range(input$CT1)
ct2 <- split_genomic_range(input$CT2)
ct3 <- split_genomic_range(input$CT3)
# Create an index table with all genomic ranges, then check for
# overlaps in each of the three tables created from the input
# columns:
index_table <- unique(rbind(ct1, ct2, ct3))
# Returns entries from the index_table if they overlap > 50% any
# entries in the lookup table or vice-versa
get_overlapping_ranges <- function(index_table, lookup_table) {
# This function does two non-equi joins. First, it checks whether
# any entries in the index_table have a 50% overlap with any
# entries in the lookup table. Second, it does the reverse, and
# checks whether any entries in the lookup_table have a 50% overlap
# with any entries in the index_table. This is required due to
# differing window sizes:
# e.g. 0-20 significantly overlaps 10-100, but 10-100 does not
# significantly overlap 0-20.
# We will need to create a "middle" column for each genomic range.
# We will need to create copies of each table to do this, otherwise
# they will end up with this new column as a side effect of the
# function call.
index_copy <- copy(index_table)
lookup_copy <- copy(lookup_table)
index_copy[, middle := start + (end - start) * 0.5]
lookup_copy[, middle := start + (end - start) * 0.5]
# In the index_copy we will also need to create dummy columns for
# the check. We need to do this so we can find the appropriate
# genomic window from the index table when we do the second
# non-equi join, otherwise the start and end columns will be
# clobbered.
index_copy[, start_index := start]
index_copy[, end_index := end]
# If the middle of a genomic range in the index table falls within
# a genomic range in the lookup table, then that genomic entry from
# the index table has a significant overlap with the corresponding
# in the lookup table
index_overlaps <- index_copy[lookup_table,
on=.(chr, middle >= start, middle <= end),
nomatch=0]
# Test the reverse: does any entry in the lookup table
# significantly overlap with any of the genomic windows in the
# index table?
lookup_overlaps <- index_copy[lookup_copy,
on=.(chr, start_index <= middle, end_index >= middle),
nomatch=0]
# Remove extra columns created by the non-equi join:
index_overlaps <- index_overlaps[,.(chr, start, end)]
lookup_overlaps <- lookup_overlaps[,.(chr, start, end)]
# Combine results and remove any duplicates that arise, e.g.
# because a genomic window in the index_table significantly
# overlaps with multiple genomic windows in the lookup table, or
# because the overlap is significant in both comparisons (i.e.
# where both windows are the same size).
overlaps <- unique(rbind(index_overlaps, lookup_overlaps))
return(overlaps)
}
ranges_in_ct1 <- get_overlapping_ranges(index_table, ct1)
ranges_in_ct2 <- get_overlapping_ranges(index_table, ct2)
ranges_in_ct3 <- get_overlapping_ranges(index_table, ct3)
# Combine results, indicating which column each genomic range was
# found to overlap:
overlaps <- rbind(
CT1=ranges_in_ct1, CT2=ranges_in_ct2, CT3=ranges_in_ct3,
idcol="input_column"
)
# Recombine the chr, start, and end columns to the original format:
overlaps[, genomic_window := paste0(chr, ":", start, "-", end)]
overlaps[, c("chr", "start", "end") := NULL]
# Convert to the wide format, so that each input column either has a
# 1 or 0 if the genomic window overlaps with 50% any other found in
# that column
overlaps <- dcast(overlaps, genomic_window ~ input_column,
fun.aggregate = length)
# Convert back to a data.frame:
overlaps <- as.data.frame(overlaps)
rownames(overlaps) <- overlaps$genomic_window
overlaps <- overlaps[,-1]
# Reorder so we can double check against the desired output:
overlaps <- overlaps[rownames(output),]
print(overlaps)
这将生成(几乎)与您提供的相同的输出:
CT1 CT2 CT3
chr1:200-400 1 1 0
chr1:800-970 1 0 0
chr2:300-700 1 1 0
chr1:250-450 1 1 0
chr2:200-500 1 1 0
chr2:600-1000 0 1 1
chr1:400-800 0 0 1
chr1:700-870 0 0 1
chr2:700-1400 0 1 1
唯一的区别是 chr1:700-870 在 CT2 列中有一个 0:这是因为它实际上不与 CT2 中的任何基因组 windows 重叠,唯一的其他 window 在 1 号染色体上是 chr1:250-450.