如何计算 R 中的重叠百分比
How to compute percentage of overlap in R
我正在尝试计算两个具有基因组坐标的数据集之间的重叠百分比,满足特定条件。
seg2
ID chrom loc.start loc.end num.mark seg.mean
AB 1 3010000 173490000 8430 0.0039
AB 1 173510000 173590000 5 -17.738
AB 1 173610000 173830000 12 0.011
AB 1 173850000 173970000 6 -16.121
AB 2 3090000 181990000 8434 0.011
BB 12 3090000 68990000 2950 -0.2022
BB 12 69010000 87790000 889 0.0267
BB 12 88010000 98550000 507 -0.3337
BB 12 98570000 115090000 800 0.0586
BB 12 115110000 119350000 197 -0.2031
BB 12 119370000 119430000 4 -20.671
超过
chr start end CNA sample.ID
1 68580000 68640000 loss 1-68580000-68640000
3 15360000 16000000 loss 3-15360000-16000000
4 122660000 123500000 gain 4-122660000-123500000
7 48320000 48400000 loss 7-48320000-48400000
12 115860000 115980000 loss 12-115860000-115980000
12 113560000 114920000 gain 12-113560000-114920000
预期输出
ID chrom loc.start loc.end num.mark seg.mean lm(percentage of overlap)
AB 1 3010000 173490000 8430 0.0039 %
AB 1 173510000 173590000 5 -17.738
AB 1 173610000 173830000 12 0.011
AB 1 173850000 173970000 6 -16.121
AB 2 3090000 181990000 8434 0.011
BB 12 3090000 68990000 2950 -0.2022
BB 12 69010000 87790000 889 0.0267
BB 12 88010000 98550000 507 -0.3337
BB 12 98570000 115090000 800 0.0586
BB 12 115110000 119350000 197 -0.2031
BB 12 119370000 119430000 4 -20.671
我试过这个脚本,但它不起作用。
for (i in 1:now(seg2)) {
seg2$lm <- if((seg2$chrom[i] == over$chr[i]) |
(seg2$loc.start[i] <= over$start[i] & seg2$loc.end[i] >= over$end[i]) |
(over$seg.mean[i] >= 0.459 & seg2$CNA[i] == "gain") |
(over$seg.mean[i] <= -0.678 & seg2$CNA[i] == "loss"),
(over$end[i]-over$start[i])/(seg2$loc.end[i]-seg2$loc.start[i])*100)
}
我知道 GenomicRanges 包,但不胜感激。
我强烈建议您使用 GenomicFeatures
来高效地执行此操作。如果您已经知道创建自己的 Granges
对象,那么您需要执行以下两个步骤来获取重叠的长度
# to find overlaps
overlappin.index = findOverlaps(object1, object2)
# to get the overlap length
width(ranges(overlapping.index, ranges(object1),ranges(object2)))
其中,"object1"和"object2"是有坐标的GRanges
个对象,"overlappin.index"是重叠对象的索引。
一旦你有了长度,你就可以很容易地得到百分比。
我正在尝试计算两个具有基因组坐标的数据集之间的重叠百分比,满足特定条件。
seg2
ID chrom loc.start loc.end num.mark seg.mean
AB 1 3010000 173490000 8430 0.0039
AB 1 173510000 173590000 5 -17.738
AB 1 173610000 173830000 12 0.011
AB 1 173850000 173970000 6 -16.121
AB 2 3090000 181990000 8434 0.011
BB 12 3090000 68990000 2950 -0.2022
BB 12 69010000 87790000 889 0.0267
BB 12 88010000 98550000 507 -0.3337
BB 12 98570000 115090000 800 0.0586
BB 12 115110000 119350000 197 -0.2031
BB 12 119370000 119430000 4 -20.671
超过
chr start end CNA sample.ID
1 68580000 68640000 loss 1-68580000-68640000
3 15360000 16000000 loss 3-15360000-16000000
4 122660000 123500000 gain 4-122660000-123500000
7 48320000 48400000 loss 7-48320000-48400000
12 115860000 115980000 loss 12-115860000-115980000
12 113560000 114920000 gain 12-113560000-114920000
预期输出
ID chrom loc.start loc.end num.mark seg.mean lm(percentage of overlap)
AB 1 3010000 173490000 8430 0.0039 %
AB 1 173510000 173590000 5 -17.738
AB 1 173610000 173830000 12 0.011
AB 1 173850000 173970000 6 -16.121
AB 2 3090000 181990000 8434 0.011
BB 12 3090000 68990000 2950 -0.2022
BB 12 69010000 87790000 889 0.0267
BB 12 88010000 98550000 507 -0.3337
BB 12 98570000 115090000 800 0.0586
BB 12 115110000 119350000 197 -0.2031
BB 12 119370000 119430000 4 -20.671
我试过这个脚本,但它不起作用。
for (i in 1:now(seg2)) {
seg2$lm <- if((seg2$chrom[i] == over$chr[i]) |
(seg2$loc.start[i] <= over$start[i] & seg2$loc.end[i] >= over$end[i]) |
(over$seg.mean[i] >= 0.459 & seg2$CNA[i] == "gain") |
(over$seg.mean[i] <= -0.678 & seg2$CNA[i] == "loss"),
(over$end[i]-over$start[i])/(seg2$loc.end[i]-seg2$loc.start[i])*100)
}
我知道 GenomicRanges 包,但不胜感激。
我强烈建议您使用 GenomicFeatures
来高效地执行此操作。如果您已经知道创建自己的 Granges
对象,那么您需要执行以下两个步骤来获取重叠的长度
# to find overlaps
overlappin.index = findOverlaps(object1, object2)
# to get the overlap length
width(ranges(overlapping.index, ranges(object1),ranges(object2)))
其中,"object1"和"object2"是有坐标的GRanges
个对象,"overlappin.index"是重叠对象的索引。
一旦你有了长度,你就可以很容易地得到百分比。