Select 行表示两个位置之间的范围,以便仅包括至少包含另一个 table 的一个位置的间隔
Select rows that represent the range between two positions so as to include only intervals that contain at least one position of another table
仅当 INTERVALS (start/end) 包含(在)至少一个 "position"(开始)的 "map" 时,我才需要保存来自 "ref" 的行table:
仿照 "ref" table:
ref<-"chr start end
chr1 1 10
chr1 20 30
chr1 30 40
chr1 40 50
chr2 20 30
chr2 40 50
chr2 80 90"
ref<-read.table(text=ref,header=T)
仿照 "map" table:
map<-"chr start
chr1 1
chr1 3
chr1 5
chr1 31
chr1 32
chr2 1
chr2 2
chr2 89"
map<-read.table(text=map,header=T)
我需要一个像这样的最终 table(只有 INTERVALS 至少包含 "map" table 值中的一个值):
final<-"chr start end
chr1 1 10
chr1 30 40
chr2 80 90"
final<-read.table(text=final,header=T)
请注意,我也考虑了染色体数目。并且,考虑的值是 "ref" 上的 "start" 和 "end" 值之间的间隔,而不仅仅是 "start" 和 "end" 值本身。
为了解决染色体的问题,我把 chr+start 和 chr+end 分别看作是 "tag" 和 tag1。
ref$tag <- paste0(ref$chr, "-", ref$start)
ref$tag1 <- paste0(ref$chr, "-", ref$end)
map$tag <- paste0(map$chr, "-", map$start)
ref[ref$start %in% map$start | ref$end %in% map$start, ]
更详细:
rows_to_keep <- ref$start %in% map$start | ref$end %in% map$start
rows_to_keep
# [1] TRUE TRUE FALSE TRUE
ref[rows_to_keep, ]
# chr start end
# 1 chr1 1 2
# 2 chr2 2 10
# 4 chr2 6 10
根据这个话题
“Finding overlapping ranges between two interval data”
"In general, it's very appropriate to use the bioconductor package IRanges to deal with problems related to intervals"
所以,你在这里:
library("GenomicRanges")
library("data.table")
gr1 = with(ref, GRanges(Rle(factor(chr,
levels=c("chr1", "chr2"))), IRanges(start, end)))
gr2 = with(map, GRanges(Rle(factor(chr,
levels=c("chr1", "chr2"))), IRanges(start, start)))
olaps<-subsetByOverlaps(gr1, gr2)
olaps <- as.data.frame(olaps)
col_headings <- c('chr','start', 'end', 'width', 'strand')
names(olaps) <- col_headings
final <- subset(olaps, select = c("chr", "start", "end"))
> final
chr start end
1 chr1 1 10
2 chr1 30 40
3 chr2 80 90
仅当 INTERVALS (start/end) 包含(在)至少一个 "position"(开始)的 "map" 时,我才需要保存来自 "ref" 的行table:
仿照 "ref" table:
ref<-"chr start end
chr1 1 10
chr1 20 30
chr1 30 40
chr1 40 50
chr2 20 30
chr2 40 50
chr2 80 90"
ref<-read.table(text=ref,header=T)
仿照 "map" table:
map<-"chr start
chr1 1
chr1 3
chr1 5
chr1 31
chr1 32
chr2 1
chr2 2
chr2 89"
map<-read.table(text=map,header=T)
我需要一个像这样的最终 table(只有 INTERVALS 至少包含 "map" table 值中的一个值):
final<-"chr start end
chr1 1 10
chr1 30 40
chr2 80 90"
final<-read.table(text=final,header=T)
请注意,我也考虑了染色体数目。并且,考虑的值是 "ref" 上的 "start" 和 "end" 值之间的间隔,而不仅仅是 "start" 和 "end" 值本身。
为了解决染色体的问题,我把 chr+start 和 chr+end 分别看作是 "tag" 和 tag1。
ref$tag <- paste0(ref$chr, "-", ref$start)
ref$tag1 <- paste0(ref$chr, "-", ref$end)
map$tag <- paste0(map$chr, "-", map$start)
ref[ref$start %in% map$start | ref$end %in% map$start, ]
更详细:
rows_to_keep <- ref$start %in% map$start | ref$end %in% map$start
rows_to_keep
# [1] TRUE TRUE FALSE TRUE
ref[rows_to_keep, ]
# chr start end
# 1 chr1 1 2
# 2 chr2 2 10
# 4 chr2 6 10
根据这个话题 “Finding overlapping ranges between two interval data” "In general, it's very appropriate to use the bioconductor package IRanges to deal with problems related to intervals" 所以,你在这里:
library("GenomicRanges")
library("data.table")
gr1 = with(ref, GRanges(Rle(factor(chr,
levels=c("chr1", "chr2"))), IRanges(start, end)))
gr2 = with(map, GRanges(Rle(factor(chr,
levels=c("chr1", "chr2"))), IRanges(start, start)))
olaps<-subsetByOverlaps(gr1, gr2)
olaps <- as.data.frame(olaps)
col_headings <- c('chr','start', 'end', 'width', 'strand')
names(olaps) <- col_headings
final <- subset(olaps, select = c("chr", "start", "end"))
> final
chr start end
1 chr1 1 10
2 chr1 30 40
3 chr2 80 90