当数据连续时,将一些行合并为一行
Merge some rows into one when the data is continuous
我有一个 bed file 作为数据帧加载到 R 中。基因组坐标看起来像这样:
chrom start end
chrX 400 600
chrX 800 1000
chrX 1000 1200
chrX 1200 1400
chrX 1600 1800
chrX 2000 2200
chrX 2200 2400
不需要保留所有行,将其压缩成这样会更好:
chrom start end
chrX 400 600
chrX 800 1400
chrX 1600 1800
chrX 2000 2400
我怎么可能做到?
我试图用 dplyr
想点什么,但没有成功。 group_by
行不通,因为我不知道如何使用第一行的起始坐标和最后一行的结束坐标将连续行的块修改为一个,也因为有很多这样的块。
使用来自 bioconductor 的 GenomicRanges 包,专为床文件等构建:
library(GenomicRanges)
# Example data
gr <- GRanges(
seqnames = Rle("chr1", 6),
ranges = IRanges(start = c(400 ,800, 1200, 1400, 1800, 2000),
end = c(600, 1000, 1400, 1600, 2000, 2200)))
gr
# GRanges object with 6 ranges and 0 metadata columns:
# seqnames ranges strand
# <Rle> <IRanges> <Rle>
# [1] chr1 [ 400, 600] *
# [2] chr1 [ 800, 1000] *
# [3] chr1 [1200, 1400] *
# [4] chr1 [1400, 1600] *
# [5] chr1 [1800, 2000] *
# [6] chr1 [2000, 2200] *
# -------
# seqinfo: 1 sequence from an unspecified genome; no seqlengths
# merge contiouse ranges into one using reduce:
reduce(gr)
# GRanges object with 4 ranges and 0 metadata columns:
# seqnames ranges strand
# <Rle> <IRanges> <Rle>
# [1] chr1 [ 400, 600] *
# [2] chr1 [ 800, 1000] *
# [3] chr1 [1200, 1600] *
# [4] chr1 [1800, 2200] *
# -------
# seqinfo: 1 sequence from an unspecified genome; no seqlength
# EDIT: if the bed file is a data.frame we can convert it to ranges object:
gr <- GRanges(seqnames(Rle(df$chrom),
ranges = IRanges(start = df$start,
end = df$end)))
我有一个 bed file 作为数据帧加载到 R 中。基因组坐标看起来像这样:
chrom start end
chrX 400 600
chrX 800 1000
chrX 1000 1200
chrX 1200 1400
chrX 1600 1800
chrX 2000 2200
chrX 2200 2400
不需要保留所有行,将其压缩成这样会更好:
chrom start end
chrX 400 600
chrX 800 1400
chrX 1600 1800
chrX 2000 2400
我怎么可能做到?
我试图用 dplyr
想点什么,但没有成功。 group_by
行不通,因为我不知道如何使用第一行的起始坐标和最后一行的结束坐标将连续行的块修改为一个,也因为有很多这样的块。
使用来自 bioconductor 的 GenomicRanges 包,专为床文件等构建:
library(GenomicRanges)
# Example data
gr <- GRanges(
seqnames = Rle("chr1", 6),
ranges = IRanges(start = c(400 ,800, 1200, 1400, 1800, 2000),
end = c(600, 1000, 1400, 1600, 2000, 2200)))
gr
# GRanges object with 6 ranges and 0 metadata columns:
# seqnames ranges strand
# <Rle> <IRanges> <Rle>
# [1] chr1 [ 400, 600] *
# [2] chr1 [ 800, 1000] *
# [3] chr1 [1200, 1400] *
# [4] chr1 [1400, 1600] *
# [5] chr1 [1800, 2000] *
# [6] chr1 [2000, 2200] *
# -------
# seqinfo: 1 sequence from an unspecified genome; no seqlengths
# merge contiouse ranges into one using reduce:
reduce(gr)
# GRanges object with 4 ranges and 0 metadata columns:
# seqnames ranges strand
# <Rle> <IRanges> <Rle>
# [1] chr1 [ 400, 600] *
# [2] chr1 [ 800, 1000] *
# [3] chr1 [1200, 1600] *
# [4] chr1 [1800, 2200] *
# -------
# seqinfo: 1 sequence from an unspecified genome; no seqlength
# EDIT: if the bed file is a data.frame we can convert it to ranges object:
gr <- GRanges(seqnames(Rle(df$chrom),
ranges = IRanges(start = df$start,
end = df$end)))