归一化样本之间的深度覆盖

Question

这是一个悬而未决的问题，旨在为基因组中的每个位置（对应于 "CpG" 个位点）定义因样本而异的状态。这个问题的原因是可用的工具定义了 CpG 的状态 windows 而不是单个 CpG。

我有一个 table 这样的：列是：染色体数（chr），兴趣碱基的初始（开始）和最终（结束）位置，预期覆盖率（深度），观察到的不同6种动物的覆盖率（深度1-深度6）。

data <- "chr start end depth depth1 depth2 depth3 depth4 depth5 depth6
chr1 3273 3273 7 200 35 1 200 850 0
chr1 3274 3274 3 50 25 5 300 1500 2
chr1 3275 3275 8 600 15 8 100 300 5
chr1 3276 3276 4 30 2 10 59 20 0
chr1 3277 3277 25 20 7 4 600 45 0"
data <- read.table(text=data, header=T)

我需要定义一个包含每一行状态的列，状态是：未覆盖区域、交替覆盖和经常覆盖。

为此，首先，我需要对样本之间的深度进行归一化，以获得可以在个体之间进行比较的值。其次，我必须定义状态之间的范围（到目前为止，任何范围都是acceptable）；

我发现这个脚本做的事情类似于我需要的深度标准化，但我还不能应用到我的案例中（这个脚本被设计为 "CpG windows" 并显示 [=34= 的频率] 和 "G" 在每个 window.

setMethod("normalizeCoverage", "methylRawList",

                        function(obj,method){

                          if(method=="median"){
                            x=sapply(obj,function(x) median(x$coverage) )
                          }else if(method=="mean"){
                            x=sapply(obj,function(x) mean(x$coverage) )
                          }else{
                            stop("method option should be either 'mean' or 'median'\n")
                          }
                          sc.fac=max(x)/x #get scaling factor
                          for(i in 1:length(obj)){
                            all.cov=obj[[i]]$coverage
                            fCs    =obj[[i]]$numCs/all.cov
                            fTs    =obj[[i]]$numT/all.cov
                            obj[[i]]$coverage=round(sc.fac[i]*obj[[i]]$coverage)
                            obj[[i]]$numCs   =round(obj[[i]]$coverage*fCs)
                            obj[[i]]$numTs   =round(obj[[i]]$coverage*fTs)
                          }
                          obj

    })

我还检查了 R 的这个 "edge package" 函数，它用于 RNAseq 标准化数据，看起来像这样：

calcNormFactors(object, method=c("TMM","RLE","upperquartile","none"), refColumn = NULL,
      logratioTrim = .3, sumTrim = 0.05, doWeighting=TRUE, Acutoff=-1e10, p=0.75)

但我还不能应用到我的数据。

我希望我的最终结果是这样的：

chr start State
chr1 3273 Often
chr1 3274 alternatively
chr1 3275 no
chr1 3276 often
chr1 3277 no

但我真的只对每个样本覆盖的归一化深度感到满意。

Answer 1

到问题的第一部分（归一化）

"Compute depth-adjusted coverage values, a straight-up application of the cpm function to the count matrix might be sufficient. This will convert your counts into count-per-million values that you can then compare informally between samples"。（Aaron Lun，英国剑桥）

通过 "cpm" 通过 "edgeR" 包从 R

归一化

datamatrix <- data [, c(4:10)]
library (edgeR)

#grouping factor
group <- c(1, 2, 2, 2, 2, 2, 2) #groups of each sample)

#create a DGEList
y <- DGEList(counts=datamatrix,group=group)

#########
$counts
  depth depth1 depth2 depth3 depth4 depth5 depth6
1     7    200     35      1    200    850      0
2     3     50     25      5    300   1500      2
3     8    600     15      8    100    300      5
4     4     30      2     10     59     20      0
5    25     20      7      4    600     45      0

$samples
       group lib.size norm.factors
depth      1       47            1
depth1     2      900            1
depth2     2       84            1
depth3     2       28            1
depth4     2     1259            1
depth5     2     2715            1
depth6     2        7            1
##################
#normalize
y <- calcNormFactors(y)

########
> y
An object of class "DGEList"
$counts
  depth depth1 depth2 depth3 depth4 depth5 depth6
1     7    200     35      1    200    850      0
2     3     50     25      5    300   1500      2
3     8    600     15      8    100    300      5
4     4     30      2     10     59     20      0
5    25     20      7      4    600     45      0

$samples
       group lib.size norm.factors
depth      1       47    0.7423735
depth1     2      900    0.5526927
depth2     2       84    0.9534847
depth3     2       28    0.8652676
depth4     2     1259    0.6588115
depth5     2     2715    1.0358307
depth6     2        7    4.3289213
##########################################

> cpm(y)
      depth     depth1    depth2    depth3    depth4     depth5    depth6
1 200621.61  402071.90 436993.56  41275.42 241125.49 302245.841      0.00
2  85980.69  100517.97 312138.26 206377.10 361688.24 533375.014  66001.27
3 229281.84 1206215.69 187282.96 330203.36 120562.75 106675.003 165003.16
4 114640.92   60310.78  24971.06 412754.21  71132.02   7111.667      0.00
5 716505.76   40207.19  87398.71 165101.68 723376.48  16001.250      0.00

标准化！

即使进行了归一化，我也有 3 个样本的很多值都为零，因为它们的覆盖率很低。我想我必须将它们从分析中删除。我考虑过做一个 PCA 测试，看看这些样本是如何分组的。

我想要一些关于用于归一化的方法的反馈以及我的问题的第二部分

归一化样本之间的深度覆盖

Normalizing Depth Coverage among samples

r

normalization

bioconductor

sapply