根据范围重叠处理输入文件

Processing the input file based on range overlap

我有一个巨大的输入文件(其中的代表性示例如下所示 input):

> input
           CT1           CT2           CT3
1 chr1:200-400  chr1:250-450  chr1:400-800
2 chr1:800-970  chr2:200-500  chr1:700-870
3 chr2:300-700 chr2:600-1000 chr2:700-1400

我想通过遵循一些规则(如下所述)来处理它,以便我得到一个 output 像:

 > output
              CT1 CT2 CT3
chr1:200-400    1   1   0
chr1:800-970    1   0   0
chr2:300-700    1   1   0
chr1:250-450    1   1   0
chr2:200-500    1   1   0
chr2:600-1000   0   1   1
chr1:400-800    0   0   1
chr1:700-870    0   1   1
chr2:700-1400   0   1   1

规则: 获取每个索引(在这种情况下第一个是 chr1:200-400),看看它是否与其他列中的值显着重叠。从显着性来看,我的意思是范围至少有 50% 的重叠。如果有,就在它所在的那一栏下面写上 1,如果没有,就写上 0.

现在我解释一下我是如何得到输出的 table。从我们的输入中,我们采用第一个索引输入[1,1],即chr1:200-400。由于它存在于第 1 列中,我们将在其下方写上 1。现在我们将检查任何其他列中是否存在此范围或重叠范围。此值仅与第二列 (CT2) 的第一个值 (chr1:250-450) 重叠。现在我们检查这种重叠是否显着。我们知道范围是 200 (chr1:200-400),与第二列值(chr1:250-450)的重叠是 150 (250-400)。由于 150 的重叠大于原始范围 (200-400 = 200) 的一半(原始范围的 50% = 100)OR 重叠范围 (250-450 = 200)。我们认为它是重叠的,并在 CT2 列下方分配 1。由于这个范围不与CT3中的任何值重叠,我们在CT3下面写0。 同样对于输出的第 9 行。 chr2:700-1400CT1 中不存在,所以在下面写 0。对于 CT2,它与 chr2:600-1000 重叠。这里原来的范围是700(chr2:700-1400),一半是350,和CT2chr2:700-1000重叠的是300(超出实际范围chr2:600-1000)。现在这个 300 的重叠不大于实际范围 700 的一半(CT3 的 chr2:700-1400),但它大于重叠范围 400 的一半(chr2:600-1000 of CT2)。因此,我们认为它是重叠的,并在 CT2 下面写上 1。由于这个范围在CT3中确实存在,所以我们在其下方也写上1。

这里是输入输出的dput:

> dput(input)
structure(list(CT1 = structure(1:3, .Label = c("chr1:200-400", 
"chr1:800-970", "chr2:300-700"), class = "factor"), CT2 = structure(1:3, .Label = c("chr1:250-450", 
"chr2:200-500", "chr2:600-1000"), class = "factor"), CT3 = structure(1:3, .Label = c("chr1:400-800", 
"chr1:700-870", "chr2:700-1400"), class = "factor")), .Names = c("CT1", 
"CT2", "CT3"), class = "data.frame", row.names = c(NA, -3L))
> dput(output)
structure(list(CT1 = c(1L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L), CT2 = c(1L, 
0L, 1L, 1L, 1L, 1L, 0L, 1L, 1L), CT3 = c(0L, 0L, 0L, 0L, 0L, 
1L, 1L, 1L, 1L)), .Names = c("CT1", "CT2", "CT3"), class = "data.frame", row.names = c("chr1:200-400", 
"chr1:800-970", "chr2:300-700", "chr1:250-450", "chr2:200-500", 
"chr2:600-1000", "chr1:400-800", "chr1:700-870", "chr2:700-1400"
))

要做到这一点需要很多步骤,以及 data.table 包中的一些概念,最值得注意的是非等值连接。我在整个过程中都对代码进行了注释,如果您想要更深入的了解,建议 运行 逐步执行:

library(data.table)

input <- structure(list(CT1 = structure(1:3, .Label = 
  c("chr1:200-400", "chr1:800-970", "chr2:300-700"), class = 
  "factor"), CT2 = structure(1:3, .Label = c("chr1:250-450", 
  "chr2:200-500", "chr2:600-1000"), class = "factor"), CT3 = 
  structure(1:3, .Label = c("chr1:400-800", "chr1:700-870", 
  "chr2:700-1400"), class = "factor")), .Names = c("CT1", 
  "CT2", "CT3"), class = "data.frame", row.names = c(NA, -3L))

output <- structure(list(CT1 = c(1L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L),
  CT2 = c(1L, 0L, 1L, 1L, 1L, 1L, 0L, 1L, 1L), CT3 = c(0L, 0L, 0L, 
  0L, 0L, 1L, 1L, 1L, 1L)), .Names = c("CT1", "CT2", "CT3"), class = 
  "data.frame", row.names = c("chr1:200-400", "chr1:800-970", 
  "chr2:300-700", "chr1:250-450", "chr2:200-500", "chr2:600-1000",
  "chr1:400-800", "chr1:700-870", "chr2:700-1400"))

# Builds a data.table by breaking a string like "chr1:300-700" into 
# three columns: chr, start, and end.
split_genomic_range <- function(str) {
  chr <- gsub(":.*", "", str)
  start <- gsub("-.*", "", gsub(".*:", "", str))
  end <- gsub(".*-", "", str)

  start <- as.numeric(start)
  end <- as.numeric(end)

  return(data.table(chr=chr, start=start, end=end))
}

# First break the input data.table into three new tables - we will need
# to perform non-equi joins of the index table (column CT1 in input) to
# the tables built from the other two columns.
ct1 <- split_genomic_range(input$CT1)
ct2 <- split_genomic_range(input$CT2)
ct3 <- split_genomic_range(input$CT3)

# Create an index table with all genomic ranges, then check for 
# overlaps in each of the three tables created from the input
# columns:
index_table <- unique(rbind(ct1, ct2, ct3))

# Returns entries from the index_table if they overlap > 50% any 
# entries in the lookup table or vice-versa
get_overlapping_ranges <- function(index_table, lookup_table) {
  # This function does two non-equi joins. First, it checks whether 
  # any entries in the index_table have a 50% overlap with any 
  # entries in  the lookup table. Second, it does the reverse, and  
  # checks whether any entries in the lookup_table have a 50% overlap 
  # with any entries in the index_table. This is required due to 
  # differing window sizes:
  # e.g. 0-20 significantly overlaps 10-100, but 10-100 does not 
  # significantly overlap 0-20.

  # We will need to create a "middle" column for each genomic range.
  # We will need to create copies of each table to do this, otherwise
  # they will end up with this new column as a side effect of the 
  # function call.
  index_copy <- copy(index_table)
  lookup_copy <- copy(lookup_table)

  index_copy[, middle := start + (end - start) * 0.5]
  lookup_copy[, middle := start + (end - start) * 0.5]

  # In the index_copy we will also need to create dummy columns for
  # the check. We need to do this so we can find the appropriate 
  # genomic window from the index table when we do the second  
  # non-equi join, otherwise the start and end columns will be 
  # clobbered. 
  index_copy[, start_index := start]
  index_copy[, end_index := end]

  # If the middle of a genomic range in the index table falls within 
  # a genomic range in the lookup table, then that genomic entry from 
  # the index table has a significant overlap with the corresponding 
  # in the lookup table
  index_overlaps <- index_copy[lookup_table, 
    on=.(chr, middle >= start, middle <= end),
    nomatch=0]

  # Test the reverse: does any entry in the lookup table 
  # significantly  overlap with any of the genomic windows in the 
  # index table?
  lookup_overlaps <- index_copy[lookup_copy,
    on=.(chr, start_index <= middle, end_index >= middle),
    nomatch=0]

  # Remove extra columns created by the non-equi join:
  index_overlaps <- index_overlaps[,.(chr, start, end)]
  lookup_overlaps <- lookup_overlaps[,.(chr, start, end)]

  # Combine results and remove any duplicates that arise, e.g. 
  # because a genomic window in the index_table significantly 
  # overlaps with multiple genomic windows in the lookup table, or 
  # because the overlap is significant in both comparisons (i.e. 
  # where both windows are the same size).
  overlaps <- unique(rbind(index_overlaps, lookup_overlaps))

  return(overlaps)
}

ranges_in_ct1 <- get_overlapping_ranges(index_table, ct1)
ranges_in_ct2 <- get_overlapping_ranges(index_table, ct2)
ranges_in_ct3 <- get_overlapping_ranges(index_table, ct3)

# Combine results, indicating which column each genomic range was 
# found to overlap:
overlaps <- rbind(
  CT1=ranges_in_ct1, CT2=ranges_in_ct2, CT3=ranges_in_ct3,
  idcol="input_column"
) 

# Recombine the chr, start, and end columns to the original format:
overlaps[, genomic_window := paste0(chr, ":", start, "-", end)]
overlaps[, c("chr", "start", "end") := NULL]

# Convert to the wide format, so that each input column either has a  
# 1 or 0 if the genomic window overlaps with 50% any other found in 
# that column
overlaps <- dcast(overlaps, genomic_window ~ input_column, 
                  fun.aggregate = length)

# Convert back to a data.frame:
overlaps <- as.data.frame(overlaps)
rownames(overlaps) <- overlaps$genomic_window
overlaps <- overlaps[,-1]

# Reorder so we can double check against the desired output:
overlaps <- overlaps[rownames(output),]
print(overlaps)

这将生成(几乎)与您提供的相同的输出:

              CT1 CT2 CT3
chr1:200-400    1   1   0
chr1:800-970    1   0   0
chr2:300-700    1   1   0
chr1:250-450    1   1   0
chr2:200-500    1   1   0
chr2:600-1000   0   1   1
chr1:400-800    0   0   1
chr1:700-870    0   0   1
chr2:700-1400   0   1   1

唯一的区别是 chr1:700-870 在 CT2 列中有一个 0:这是因为它实际上不与 CT2 中的任何基因组 windows 重叠,唯一的其他 window 在 1 号染色体上是 chr1:250-450.