如何处理两个床文件以并行查找重叠区域?
How to process two bed files for finding overlapped regions in parallel?
我想处理多个床文件以查找重叠区域。我将我的数据集读取为数据框,以及如何有效地并行扫描两个数据集以检测重叠区域出现的位置。我的方法是每次我将数据框对象的每个单元格的峰值区域作为查询,在 intervaltree 中获取另一个数据框的所有行的峰值区域,然后搜索重叠区域。我很困惑如何在 R 中实现这一点。请帮助处理生物信息学中的床格式文件。如果有人指出我如何做到这一点,我将不胜感激......
这是我想要实现的简单示例:
[1] chr1 [10171, 10226] * | MACS_peak_1 7.12
[2] chr1 [32698, 33079] * | MACS_peak_2 13.92
[3] chr1 [34757, 34794] * | MACS_peak_3 6.08
[4] chr1 [37786, 37833] * | MACS_peak_4 2.44
[5] chr1 [38449, 38484] * | MACS_peak_5 3.61
[6] chr1 [38584, 38838] * | MACS_peak_6 4.12
..
..
[] chrX [155191467, 155191508] * | MACS_peak_77948 3.80
[] chrX [155192786, 155192821] * | MACS_peak_77949 3.71
[] chrX [155206352, 155206433] * | MACS_peak_77950 3.81
[] chrX [155238796, 155238831] * | MACS_peak_77951 3.81
[n-1] chrX [155246563, 155246616] * | MACS_peak_77952 2.44
[n] chrX [155258442, 155258491] * | MACS_peak_77953 5.08
#step 1: read two bed files in R:
bed_1 <- as(import.bed(bedFile_1), "GRanges")
bed_2 <- as(import.bed(bedFile_2), "GRanges")
bed_3 <- as(import.bed(bedFile_3), "GRanges")
step 2: extract first row of the bed_1 (only take one specific interval as query). each row is considered as one specific genomic interval
peak <- bed_1[1] # only take one row once
bed_2.intvl <- GenomicRanges::GIntervalTree(bed_2)
#step 3: find overlapped regions:
overlap <- GenomicRanges::findOverlaps(peak, bed_2.intvl)
# step 4: go back to original bed_2 and look at which interval were overlapped with peak that comes from bed_1, what's the significance of each of these interval that comes from bed_2.
# step 5: then iterate next interval from bed_1 to repeat above process
使用Bioconductor, import the bed files using rtracklayer
library(rtracklayer)
bed1 = import("foo.bed")
bed2 = import("bar.bed")
然后查询'overlaps';不太清楚这对你意味着什么,也许
bed1OverlappingBed2 = bed1[bed1 %over% bed2]
更灵活,findOverlaps(bed1, bed2)
。 Follow-up 有关此方法的问题应提交至 Bioconductor support forum。
假设我们输入了 query
和 subject
。找到所有匹配项
hits <- findOverlaps(query, subject)
这个 returns 看起来像 two-column 矩阵的东西。第一列是查询的索引,第二列是主题的索引。如果query的第一个元素与subject的多个元素重叠,则query hits列下会出现1
多次出现的几行,与该范围重叠的subject的索引配对。采用原始范围集和 'expand' 它们来匹配匹配项,例如
query[queryHits(hits)]
并找到它们重叠的区域的交集
pintersect(query[queryHits(hits)], subject[subjectHits(hits)])
这为您提供了 element-wise 重叠,但没有进行迭代。
举个小例子,这里有一些 'chr1' 上的范围,表示为 GRanges
对象(床文件也表示为 GRanges,但有额外的 mcols()
信息来自床上文件)。
query = GRanges("chr1", IRanges(c(10, 20, 30), width=5))
subject = GRanges("chr1", IRanges(c(10, 14), width=9))
他们看起来像
> query
GRanges object with 3 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr1 [10, 14] *
[2] chr1 [20, 24] *
[3] chr1 [30, 34] *
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
> subject
GRanges object with 2 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr1 [10, 18] *
[2] chr1 [14, 22] *
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
点击次数如下:
> hits = findOverlaps(query, subject)
> hits
Hits object with 3 hits and 0 metadata columns:
queryHits subjectHits
<integer> <integer>
[1] 1 1
[2] 1 2
[3] 2 2
-------
queryLength: 3
subjectLength: 2
可以看到query的第一个range和subject的range 1和2重叠了。这是相交范围
> pintersect(query[queryHits(hits)], subject[subjectHits(hits)])
GRanges object with 3 ranges and 1 metadata column:
seqnames ranges strand | hit
<Rle> <IRanges> <Rle> | <logical>
[1] chr1 [10, 14] * | TRUE
[2] chr1 [14, 14] * | TRUE
[3] chr1 [20, 22] * | TRUE
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
所以query 1和subject 1在位置10到14重叠,query 1和subject 2在位置14重叠,query 2和subject 2在位置20到22重叠。(Bioconductor使用基于1的闭区间; UCSC 使用基于 0 的 half-open 间隔;rtracklayer::import.bed()
在导入文件时做正确的事情。
我想处理多个床文件以查找重叠区域。我将我的数据集读取为数据框,以及如何有效地并行扫描两个数据集以检测重叠区域出现的位置。我的方法是每次我将数据框对象的每个单元格的峰值区域作为查询,在 intervaltree 中获取另一个数据框的所有行的峰值区域,然后搜索重叠区域。我很困惑如何在 R 中实现这一点。请帮助处理生物信息学中的床格式文件。如果有人指出我如何做到这一点,我将不胜感激......
这是我想要实现的简单示例:
[1] chr1 [10171, 10226] * | MACS_peak_1 7.12
[2] chr1 [32698, 33079] * | MACS_peak_2 13.92
[3] chr1 [34757, 34794] * | MACS_peak_3 6.08
[4] chr1 [37786, 37833] * | MACS_peak_4 2.44
[5] chr1 [38449, 38484] * | MACS_peak_5 3.61
[6] chr1 [38584, 38838] * | MACS_peak_6 4.12
..
..
[] chrX [155191467, 155191508] * | MACS_peak_77948 3.80
[] chrX [155192786, 155192821] * | MACS_peak_77949 3.71
[] chrX [155206352, 155206433] * | MACS_peak_77950 3.81
[] chrX [155238796, 155238831] * | MACS_peak_77951 3.81
[n-1] chrX [155246563, 155246616] * | MACS_peak_77952 2.44
[n] chrX [155258442, 155258491] * | MACS_peak_77953 5.08
#step 1: read two bed files in R:
bed_1 <- as(import.bed(bedFile_1), "GRanges")
bed_2 <- as(import.bed(bedFile_2), "GRanges")
bed_3 <- as(import.bed(bedFile_3), "GRanges")
step 2: extract first row of the bed_1 (only take one specific interval as query). each row is considered as one specific genomic interval
peak <- bed_1[1] # only take one row once
bed_2.intvl <- GenomicRanges::GIntervalTree(bed_2)
#step 3: find overlapped regions:
overlap <- GenomicRanges::findOverlaps(peak, bed_2.intvl)
# step 4: go back to original bed_2 and look at which interval were overlapped with peak that comes from bed_1, what's the significance of each of these interval that comes from bed_2.
# step 5: then iterate next interval from bed_1 to repeat above process
使用Bioconductor, import the bed files using rtracklayer
library(rtracklayer)
bed1 = import("foo.bed")
bed2 = import("bar.bed")
然后查询'overlaps';不太清楚这对你意味着什么,也许
bed1OverlappingBed2 = bed1[bed1 %over% bed2]
更灵活,findOverlaps(bed1, bed2)
。 Follow-up 有关此方法的问题应提交至 Bioconductor support forum。
假设我们输入了 query
和 subject
。找到所有匹配项
hits <- findOverlaps(query, subject)
这个 returns 看起来像 two-column 矩阵的东西。第一列是查询的索引,第二列是主题的索引。如果query的第一个元素与subject的多个元素重叠,则query hits列下会出现1
多次出现的几行,与该范围重叠的subject的索引配对。采用原始范围集和 'expand' 它们来匹配匹配项,例如
query[queryHits(hits)]
并找到它们重叠的区域的交集
pintersect(query[queryHits(hits)], subject[subjectHits(hits)])
这为您提供了 element-wise 重叠,但没有进行迭代。
举个小例子,这里有一些 'chr1' 上的范围,表示为 GRanges
对象(床文件也表示为 GRanges,但有额外的 mcols()
信息来自床上文件)。
query = GRanges("chr1", IRanges(c(10, 20, 30), width=5))
subject = GRanges("chr1", IRanges(c(10, 14), width=9))
他们看起来像
> query
GRanges object with 3 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr1 [10, 14] *
[2] chr1 [20, 24] *
[3] chr1 [30, 34] *
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
> subject
GRanges object with 2 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr1 [10, 18] *
[2] chr1 [14, 22] *
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
点击次数如下:
> hits = findOverlaps(query, subject)
> hits
Hits object with 3 hits and 0 metadata columns:
queryHits subjectHits
<integer> <integer>
[1] 1 1
[2] 1 2
[3] 2 2
-------
queryLength: 3
subjectLength: 2
可以看到query的第一个range和subject的range 1和2重叠了。这是相交范围
> pintersect(query[queryHits(hits)], subject[subjectHits(hits)])
GRanges object with 3 ranges and 1 metadata column:
seqnames ranges strand | hit
<Rle> <IRanges> <Rle> | <logical>
[1] chr1 [10, 14] * | TRUE
[2] chr1 [14, 14] * | TRUE
[3] chr1 [20, 22] * | TRUE
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
所以query 1和subject 1在位置10到14重叠,query 1和subject 2在位置14重叠,query 2和subject 2在位置20到22重叠。(Bioconductor使用基于1的闭区间; UCSC 使用基于 0 的 half-open 间隔;rtracklayer::import.bed()
在导入文件时做正确的事情。