在 GenomicRanges 对象中合并具有相同 属性 的相邻容器
Merging adjacent bins with same property in GenomicRanges object
将基因组分割成相邻的非重叠箱,例如通过 tileGenome
,我通过某种方式为每个 bin(比如 1 或 2)计算了一些 属性。
现在我想合并相邻的相同 属性。
一个最小的例子如下所示:
library(GenomicRanges)
chrSizes <- c(chr1 = 1000, chr2 = 500)
bins <- tileGenome(chrSizes, tilewidth = 200, cut.last.tile.in.chrom = T)
bins$property <- rep(1:2, each = 4)
bins
GRanges object with 8 ranges and 1 metadata column:
seqnames ranges strand | property
<Rle> <IRanges> <Rle> | <integer>
[1] chr1 1-200 * | 1
[2] chr1 201-400 * | 1
[3] chr1 401-600 * | 1
[4] chr1 601-800 * | 1
[5] chr1 801-1000 * | 2
[6] chr2 1-200 * | 2
[7] chr2 201-400 * | 2
[8] chr2 401-500 * | 2
-------
seqinfo: 2 sequences from an unspecified genome
前 4 个 bin 有 属性 1,因此应合并为一个 bin。
我查看了 GRanges
文档,但找不到明显的本机解决方案。
请注意,必须考虑 seqname
边界(例如,chr1 和 chr2 保持分开,与 属性 无关)
显然,我可以使用循环,但我宁愿使用原生的 GRange 解决方案,例如使用我可能已经监督过的 union
。
所需的输出应如下所示:
seqnames ranges strand | property
<Rle> <IRanges> <Rle> | <integer>
[1] chr1 1-800 * | 1
[2] chr1 801-1000 * | 2
[3] chr2 1-500 * | 2
R 基因组范围:
result <- unlist(reduce(split(bins, ~property)))
result$property <- names(result)
# GRanges object with 3 ranges and 1 metadata column:
# seqnames ranges strand | property
# <Rle> <IRanges> <Rle> | <character>
# 1 chr1 1-800 * | 1
# 2 chr1 801-1000 * | 2
# 2 chr2 1-500 * | 2
# -------
# seqinfo: 2 sequences from an unspecified genome
Python PyRanges:
import pandas as pd
from io import StringIO
import pyranges as pr
c = """Chromosome Start End Value
chr1 1 200 Python
chr1 201 400 Python
chr1 401 600 Python
chr1 601 800 Python
chr1 801 1000 R
chr2 1 200 R
chr2 201 400 R
chr2 401 500 R"""
df = pd.read_table(StringIO(c), sep=" ")
gr = pr.PyRanges(df)
gr.merge(by="Value", slack=1)
# +--------------+-----------+-----------+------------+
# | Chromosome | Start | End | Value |
# | (category) | (int32) | (int32) | (object) |
# |--------------+-----------+-----------+------------|
# | chr1 | 1 | 800 | Python |
# | chr1 | 801 | 1000 | R |
# | chr2 | 1 | 500 | R |
# +--------------+-----------+-----------+------------+
# Unstranded PyRanges object has 3 rows and 4 columns from 2 chromosomes.
# For printing, the PyRanges was sorted on Chromosome.
将基因组分割成相邻的非重叠箱,例如通过 tileGenome
,我通过某种方式为每个 bin(比如 1 或 2)计算了一些 属性。
现在我想合并相邻的相同 属性。 一个最小的例子如下所示:
library(GenomicRanges)
chrSizes <- c(chr1 = 1000, chr2 = 500)
bins <- tileGenome(chrSizes, tilewidth = 200, cut.last.tile.in.chrom = T)
bins$property <- rep(1:2, each = 4)
bins
GRanges object with 8 ranges and 1 metadata column:
seqnames ranges strand | property
<Rle> <IRanges> <Rle> | <integer>
[1] chr1 1-200 * | 1
[2] chr1 201-400 * | 1
[3] chr1 401-600 * | 1
[4] chr1 601-800 * | 1
[5] chr1 801-1000 * | 2
[6] chr2 1-200 * | 2
[7] chr2 201-400 * | 2
[8] chr2 401-500 * | 2
-------
seqinfo: 2 sequences from an unspecified genome
前 4 个 bin 有 属性 1,因此应合并为一个 bin。
我查看了 GRanges
文档,但找不到明显的本机解决方案。
请注意,必须考虑 seqname
边界(例如,chr1 和 chr2 保持分开,与 属性 无关)
显然,我可以使用循环,但我宁愿使用原生的 GRange 解决方案,例如使用我可能已经监督过的 union
。
所需的输出应如下所示:
seqnames ranges strand | property
<Rle> <IRanges> <Rle> | <integer>
[1] chr1 1-800 * | 1
[2] chr1 801-1000 * | 2
[3] chr2 1-500 * | 2
R 基因组范围:
result <- unlist(reduce(split(bins, ~property)))
result$property <- names(result)
# GRanges object with 3 ranges and 1 metadata column:
# seqnames ranges strand | property
# <Rle> <IRanges> <Rle> | <character>
# 1 chr1 1-800 * | 1
# 2 chr1 801-1000 * | 2
# 2 chr2 1-500 * | 2
# -------
# seqinfo: 2 sequences from an unspecified genome
Python PyRanges:
import pandas as pd
from io import StringIO
import pyranges as pr
c = """Chromosome Start End Value
chr1 1 200 Python
chr1 201 400 Python
chr1 401 600 Python
chr1 601 800 Python
chr1 801 1000 R
chr2 1 200 R
chr2 201 400 R
chr2 401 500 R"""
df = pd.read_table(StringIO(c), sep=" ")
gr = pr.PyRanges(df)
gr.merge(by="Value", slack=1)
# +--------------+-----------+-----------+------------+
# | Chromosome | Start | End | Value |
# | (category) | (int32) | (int32) | (object) |
# |--------------+-----------+-----------+------------|
# | chr1 | 1 | 800 | Python |
# | chr1 | 801 | 1000 | R |
# | chr2 | 1 | 500 | R |
# +--------------+-----------+-----------+------------+
# Unstranded PyRanges object has 3 rows and 4 columns from 2 chromosomes.
# For printing, the PyRanges was sorted on Chromosome.