以 1kb windows 绘制覆盖深度？

Question

我想绘制整个基因组的平均覆盖深度，染色体按升序排列。我使用 samtools 为我的基因组计算了每个位置的覆盖深度。我想生成一个图（使用 1kb windows），如图 7：http://www.g3journal.org/content/ggg/6/8/2421/F7.large.jpg?width=800&height=600&carousel=1

示例数据框：

Chr   locus depth
chr1    1   20  
chr1    2   24  
chr1    3   26  
chr2    1   53  
chr2    2   71  
chr2    3   74  
chr3    1   29  
chr3    2   36  
chr3    3   39

我是否需要更改数据帧的格式以允许 V2 变量连续编号？有没有办法平均每 1000 行，并绘制 1kb windows？我将如何着手策划？

更新编辑：我能够使用此 post: 创建一个新数据集作为非重叠 1kb windows 的滚动平均值，并且我确实使 V2 连续，即（1:9 而不是 1,2 ,3,1,2,3,1,2,3)

library(reshape) # to rename columns
library(data.table) # to make sliding window dataframe
library(zoo) # to apply rolling function for sliding window

#genome coverage as sliding window
Xdepth.average<-setDT(Xdepth)[, .(
  window.start = rollapply(locus, width=1000, by=1000, FUN=min, align="left", partial=TRUE),
  window.end = rollapply(locus, width=1000, by=1000, FUN=max, align="left", partial=TRUE),
  coverage = rollapply(coverage, width=1000, by=1000, FUN=mean, align="left", partial=TRUE)
), .(Chr)]

并绘制

library(ggplot2)
Xdepth.average.plot <- ggplot(Xdepth.average, aes(x=window.end, y=coverage, colour=Chr)) + 
  geom_point(shape = 20, size = 1) +
  scale_x_continuous(name="Genomic Position (bp)", limits=c(0, 12071326), labels = scales::scientific) +
  scale_y_continuous(name="Average Coverage Depth", limits=c(0, 200))

我在使用 facet_grid 时运气不好，所以我使用 geom_vline(xintercept = c() 添加了参考线。请参阅下面我 post 编辑的答案以获得额外的 details/codes 以及指向图的链接。现在我只需要处理标签...

Answer 1

为了解决问题的绘图部分，您是否尝试过将 + facet_grid(~ Chr) 添加到您的绘图中？（或 + facet_grid(~ V2) 取决于您的变量名称）

如果我使用您的示例数据，我没有看到您的错误消息。当您尝试采用 log(0) 时经常会看到该消息，因此您可能想要添加一个伪计数 log(x + 1)，采用 sqrt 或 asinh 转换（后者如果您使用负值）。关于示例数据的主题，post示例数据的格式可以被其他用户复制粘贴以测试您的问题，例如：

depth <- data.frame(
  Chr = paste0("chr", c(1, 1, 1, 2, 2, 2, 3, 3, 3)),
  locus = c(1, 2, 3, 1, 2, 3, 1, 2, 3),
  depth = c(20, 24, 26, 53, 71, 74, 29, 36, 39)
)

要解决生物信息学部分，您可能想看看 GenomicRanges bioconductor 包：它有一个 tileGenome() 制作箱子的功能，您可以使用 findOverlaps()与您的数据和垃圾箱。一旦有了这些重叠，您就可以 split() 根据重叠的 bin 来计算数据，并计算每个拆分的平均覆盖率。

请注意，您可能需要花一些时间来熟悉 GRanges 对象结构并以该（或 GPos）格式获取数据。 GRanges 对象类似于具有基因组间隔的床文件，而 GPos 对象类似于精确的单核苷酸坐标。

但是，您确定不需要每个 bin 的读取计数，而不是平均覆盖率吗？最好记住，覆盖率对长读有点偏见。

作为非 R 解决方案，您还可以在 deeptools 套件中使用 bamCoverage，binsize 为 1000 bp。

编辑：可重现的绘图示例

library(ggplot2, verbose = F, quietly = T)
suppressPackageStartupMessages(library(GenomicRanges))

# Setting up some dummy data
seqinfo <- rtracklayer::SeqinfoForUCSCGenome("hg19")
seqinfo <- keepStandardChromosomes(seqinfo)
granges <- tileGenome(seqinfo, tilewidth = 1e6, cut.last.tile.in.chrom = T)
granges$y <- rnorm(length(granges))

# Convert to dataframe
df <- as.data.frame(granges)

# The plotting
ggplot(df, aes(x = (start + end)/2, y = y)) +
  geom_point() +
  facet_grid(~ seqnames, scales = "free_x", space = "free_x") +
  scale_x_continuous(expand = c(0,0)) +
  theme(aspect.ratio = NULL,
        panel.spacing = unit(0, "mm"))

^{由 reprex package (v0.2.1)}

于 2019-04-22 创建

Answer 2

更多地使用这个程序，我能够创建一个新的数据集作为非重叠 1kb windows 的滚动平均值使用这个 post: 很长或占用大量内存。

library(reshape) # to rename columns
library(data.table) # to make sliding window dataframe
library(zoo) # to apply rolling function for sliding window
library(ggplot2)

 #upload data to dataframe, rename headers, make locus continuous, create subsets
depth <- read.table("sorted.depth", sep="\t", header=F)
depth<-rename(depth,c(V1="Chr", V2="locus", V3="coverageX", V3="coverageY")
depth$locus <- 1:12157105
Xdepth<-subset(depth, select = c("Chr", "locus","coverageX"))

#genome coverage as sliding window
Xdepth.average<-setDT(Xdepth)[, .(
  window.start = rollapply(locus, width=1000, by=1000, FUN=min, align="left", partial=TRUE),
  window.end = rollapply(locus, width=1000, by=1000, FUN=max, align="left", partial=TRUE),
  coverage = rollapply(coverage, width=1000, by=1000, FUN=mean, align="left", partial=TRUE)
), .(Chr)]

绘制新数据集：

#plot sliding window by end position and coverage
Xdepth.average.plot <- ggplot(Xdepth.average, aes(x=window.end, y=coverage, colour=Chr)) + 
  geom_point(shape = 20, size = 1) +
  scale_x_continuous(name="Genomic Position (bp)", limits=c(0, 12071326), labels = scales::scientific) +
  scale_y_continuous(name="Average Coverage Depth", limits=c(0, 250))

然后我尝试添加 facet_grid(. ~ Chr) 以按染色体分割，但每个面板间隔很远并且重复整个轴而不是连续的。

更新：我尝试了 scales = "free_x" 和 space = "free_x" 的各种调整。最接近的是从 scale_x_continuous() 中删除限制并使用 scales = "free_x" 和 space = "free_x" 与 facet_grid 但面板宽度仍然与染色体大小和 x 轴不成比例非常不稳定。为了比较，我在染色体之间使用 geom_vline(xintercept = c() 手动添加了参考线（预期结果）。

没有面板标签的理想分离和 X 轴使用

Xdepth.average.plot +
  geom_vline(xintercept = c(230218, 1043402, 1360022, 2891955, 3468829, 3738990, 4829930, 5392573, 5832461, 6578212, 7245028, 8323205, 9247636, 10031969, 11123260, 12071326, 12157105))

Plot with Reference lines

取消 scale_x_continuous() 的限制并使用 facet_grid

Xdepth.average.plot5 <- ggplot(Xdepth.average, aes(x=window.end, y=coverage, colour=Chr)) + 
  geom_point(shape = 20, size = 1) +
  scale_x_continuous(name="Genomic Position (bp)", labels = scales::scientific, breaks = 
                       c(0, 2000000, 4000000, 6000000, 8000000, 10000000, 12000000)) +
  scale_y_continuous(name="Average Coverage Depth", limits=c(0, 200), breaks = c(0, 50, 100, 150, 200, 300, 400, 500)) +
  theme_bw() +
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank()) +
  theme(legend.position="none")
X.p5 <- Xdepth.average.plot5 + facet_grid(. ~ Chr, labeller=chr_labeller, space="free_x", scales = "free_x")+
  theme(panel.spacing.x = grid::unit(0, "cm"))
X.p5

Plot with Facets and no limit on X-axis

以 1kb windows 绘制覆盖深度？

Plotting coverage depth in 1kb windows?

r

bioinformatics

ggplot2

genome