查找重叠区域并提取各自的值
Find overlapping regions and extract respective value
如何找到重叠坐标并提取重叠区域的相应 seg.mean 值?
data1
Rl pValue chr start end CNA
2 2.594433 6 129740000 129780000 gain
2 3.941399 6 130080000 130380000 gain
1 1.992114 10 80900000 81100000 gain
1 7.175750 16 44780000 44920000 gain
数据2
ID chrom loc.start loc.end num.mark seg.mean
8410 6 129750000 129760000 8430 0.0039
8410 10 80907000 81000000 5 -1.7738
8410 16 44790000 44910000 12 0.0110
数据输出
Rl pValue chr start end CNA seg.mean
2 2.594433 6 129750000 129760000 gain 0.0039
1 1.992114 10 80907000 81000000 gain -1.7738
1 7.175750 16 44790000 44910000 gain 0.0110
正如@Roland 正确建议的那样,这是一个可能的 data.table::foverlaps
解决方案
library(data.table)
setDT(data1) ; setDT(data2) # Convert data sets to data.table objects
setnames(data2, c("loc.start", "loc.end"), c("start", "end")) # Rename columns so they will match in both sets
setkey(data2, start, end) # key the smaller data so foverlaps will work
foverlaps(data1, data2, nomatch = 0L)[, 1:5 := NULL][] # run foverlaps and remove the unnecessary columns
# seg.mean Rl pValue chr i.start i.end CNA
# 1: 0.0039 2 2.594433 6 129740000 129780000 gain
# 2: -1.7738 1 1.992114 10 80900000 81100000 gain
# 3: 0.0110 1 7.175750 16 44780000 44920000 gain
或者
indx <- foverlaps(data1, data2, nomatch = 0L, which = TRUE) # run foverlaps in order to find indexes using `which`
data1[indx$xid][, seg.mean := data2[indx$yid]$seg.mean][] # update matches
# Rl pValue chr start end CNA seg.mean
# 1: 2 2.594433 6 129740000 129780000 gain 0.0039
# 2: 1 1.992114 10 80900000 81100000 gain -1.7738
# 3: 1 7.175750 16 44780000 44920000 gain 0.0110
当我们处理基因组数据时,将数据保存为 Granges 对象更容易,然后我们可以使用 - GenomicRanges
包中的 - mergeByOverlaps(g1,g2)
,请参见下面的示例:
library("GenomicRanges")
#data
x1 <- read.table(text="Rl pValue chr start end CNA
2 2.594433 6 129740000 129780000 gain
2 3.941399 6 130080000 130380000 gain
1 1.992114 10 80900000 81100000 gain
1 7.175750 16 44780000 44920000 gain",header=TRUE)
x2 <- read.table(text="ID chrom loc.start loc.end num.mark seg.mean
8410 6 129750000 129760000 8430 0.0039
8410 10 80907000 81000000 5 -1.7738
8410 16 44790000 44910000 12 0.0110",header=TRUE)
g1 <- GRanges(seqnames=paste0("chr",x1$chr),
IRanges(start=x1$start,
end=x1$end),
CNA=x1$CNA,
Rl=x1$Rl)
g2 <- GRanges(seqnames=paste0("chr",x2$chrom),
IRanges(start=x2$loc.start,
end=x2$loc.end),
ID=x2$ID,
num.mark=x2$num.mark,
seq.mean=x2$seg.mean)
mergeByOverlaps(g1,g2)
# DataFrame with 3 rows and 7 columns
# g1 CNA Rl g2 ID num.mark seq.mean
# <GRanges> <factor> <integer> <GRanges> <integer> <integer> <numeric>
# 1 chr6:*:[129740000, 129780000] gain 2 chr6:*:[129750000, 129760000] 8410 8430 0.0039
# 2 chr10:*:[ 80900000, 81100000] gain 1 chr10:*:[ 80907000, 81000000] 8410 5 -1.7738
# 3 chr16:*:[ 44780000, 44920000] gain 1 chr16:*:[ 44790000, 44910000] 8410 12 0.0110
编辑: 添加了 sessionInfo()
输出:
R version 3.2.0 (2015-04-16)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
locale:
[1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252 LC_MONETARY=English_United Kingdom.1252
[4] LC_NUMERIC=C LC_TIME=English_United Kingdom.1252
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] GenomicRanges_1.20.3 GenomeInfoDb_1.4.0 IRanges_2.2.1 S4Vectors_0.6.0 BiocGenerics_0.14.0
[6] BiocInstaller_1.18.1
loaded via a namespace (and not attached):
[1] XVector_0.8.0 tools_3.2.0
如何找到重叠坐标并提取重叠区域的相应 seg.mean 值?
data1
Rl pValue chr start end CNA
2 2.594433 6 129740000 129780000 gain
2 3.941399 6 130080000 130380000 gain
1 1.992114 10 80900000 81100000 gain
1 7.175750 16 44780000 44920000 gain
数据2
ID chrom loc.start loc.end num.mark seg.mean
8410 6 129750000 129760000 8430 0.0039
8410 10 80907000 81000000 5 -1.7738
8410 16 44790000 44910000 12 0.0110
数据输出
Rl pValue chr start end CNA seg.mean
2 2.594433 6 129750000 129760000 gain 0.0039
1 1.992114 10 80907000 81000000 gain -1.7738
1 7.175750 16 44790000 44910000 gain 0.0110
正如@Roland 正确建议的那样,这是一个可能的 data.table::foverlaps
解决方案
library(data.table)
setDT(data1) ; setDT(data2) # Convert data sets to data.table objects
setnames(data2, c("loc.start", "loc.end"), c("start", "end")) # Rename columns so they will match in both sets
setkey(data2, start, end) # key the smaller data so foverlaps will work
foverlaps(data1, data2, nomatch = 0L)[, 1:5 := NULL][] # run foverlaps and remove the unnecessary columns
# seg.mean Rl pValue chr i.start i.end CNA
# 1: 0.0039 2 2.594433 6 129740000 129780000 gain
# 2: -1.7738 1 1.992114 10 80900000 81100000 gain
# 3: 0.0110 1 7.175750 16 44780000 44920000 gain
或者
indx <- foverlaps(data1, data2, nomatch = 0L, which = TRUE) # run foverlaps in order to find indexes using `which`
data1[indx$xid][, seg.mean := data2[indx$yid]$seg.mean][] # update matches
# Rl pValue chr start end CNA seg.mean
# 1: 2 2.594433 6 129740000 129780000 gain 0.0039
# 2: 1 1.992114 10 80900000 81100000 gain -1.7738
# 3: 1 7.175750 16 44780000 44920000 gain 0.0110
当我们处理基因组数据时,将数据保存为 Granges 对象更容易,然后我们可以使用 - GenomicRanges
包中的 - mergeByOverlaps(g1,g2)
,请参见下面的示例:
library("GenomicRanges")
#data
x1 <- read.table(text="Rl pValue chr start end CNA
2 2.594433 6 129740000 129780000 gain
2 3.941399 6 130080000 130380000 gain
1 1.992114 10 80900000 81100000 gain
1 7.175750 16 44780000 44920000 gain",header=TRUE)
x2 <- read.table(text="ID chrom loc.start loc.end num.mark seg.mean
8410 6 129750000 129760000 8430 0.0039
8410 10 80907000 81000000 5 -1.7738
8410 16 44790000 44910000 12 0.0110",header=TRUE)
g1 <- GRanges(seqnames=paste0("chr",x1$chr),
IRanges(start=x1$start,
end=x1$end),
CNA=x1$CNA,
Rl=x1$Rl)
g2 <- GRanges(seqnames=paste0("chr",x2$chrom),
IRanges(start=x2$loc.start,
end=x2$loc.end),
ID=x2$ID,
num.mark=x2$num.mark,
seq.mean=x2$seg.mean)
mergeByOverlaps(g1,g2)
# DataFrame with 3 rows and 7 columns
# g1 CNA Rl g2 ID num.mark seq.mean
# <GRanges> <factor> <integer> <GRanges> <integer> <integer> <numeric>
# 1 chr6:*:[129740000, 129780000] gain 2 chr6:*:[129750000, 129760000] 8410 8430 0.0039
# 2 chr10:*:[ 80900000, 81100000] gain 1 chr10:*:[ 80907000, 81000000] 8410 5 -1.7738
# 3 chr16:*:[ 44780000, 44920000] gain 1 chr16:*:[ 44790000, 44910000] 8410 12 0.0110
编辑: 添加了 sessionInfo()
输出:
R version 3.2.0 (2015-04-16)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
locale:
[1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252 LC_MONETARY=English_United Kingdom.1252
[4] LC_NUMERIC=C LC_TIME=English_United Kingdom.1252
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] GenomicRanges_1.20.3 GenomeInfoDb_1.4.0 IRanges_2.2.1 S4Vectors_0.6.0 BiocGenerics_0.14.0
[6] BiocInstaller_1.18.1
loaded via a namespace (and not attached):
[1] XVector_0.8.0 tools_3.2.0