差异表达式分析-开关截距系数

Question

我正在尝试使用 edgeR 对生物计数数据集进行差异表达分析。我的样本分为病例和对照，我想知道病例样本（即有条件的样本）与对照中上调或下调的基因。但是，我遇到一个问题，即当前基因的结果与对照样本相关，而不是使用 edgeR 时的情况。我可以用假数据在 R 中重现这个问题。

假数据在对照中的计数值低于案例样本，因此我们预计所有基因在案例样本中都会上调：

#First create the expression matrix
set.seed(101)
#create data so first 50 (the controls) have lower values than second 50 samples (those with the condition)
exprDat <- cbind(matrix(round((runif(500)/10)*100),ncol=50),
                    matrix(round((1-runif(500)/10)*100),ncol=50))
colnames(exprDat) <- paste0("sample_",1:100)
rownames(exprDat) <- paste0("gene_",1:10)

#Now create the annotation dataset
targets <- data.frame("group_sample"=colnames(exprDat),
                      "case_control"=as.factor(c(rep("Control",50),
                                                        rep("Case",50))))
#create the design matrix comparing case and control
design <- model.matrix(~case_control, data = targets)
y <- edgeR::DGEList(counts = exprDat,
                    group = targets[["case_control"]])
#normalise
y <- edgeR::calcNormFactors(y,method = 'TMM')
y <- edgeR::estimateDisp(y, design)
#build linear model
fit <- edgeR::glmFit(y, design = design)
#test the comparison, coef=1 is the intercept
test <- edgeR::glmLRT(fit,coef=2)
pvals <- test$table

我对上面的问题是 logFC 都与对照样本有关，而不是与案例有关。我们首先可以在设计中看到这一点，因为该列是 case_controlControl:

> head(design)
  (Intercept) case_controlControl
1           1                   1
2           1                   1
3           1                   1
4           1                   1
5           1                   1
6           1                   1

然后 logFC 表明这些基因在对照样本中与病例相比下调：

> pvals
              logFC   logCPM        LR      PValue
gene_1  -0.14418015 16.69933 2.4281485 0.119173587
gene_2  -0.03421562 16.69108 0.1422319 0.706072179
gene_3  -0.12961726 16.69159 1.9632930 0.161161580
gene_4  -0.17710527 16.68963 3.5894597 0.058147147
gene_5  -0.14551401 16.69491 2.4641372 0.116471640
gene_6   0.17585301 16.70497 4.1366713 0.041963611
gene_7  -0.05396444 16.69328 0.3514909 0.553270396
gene_8  -0.15662395 16.69380 2.8394354 0.091976525
gene_9  -0.09823345 16.69603 1.1459499 0.284398595
gene_10 -0.30105913 16.68291 9.8090930 0.001736511

起初我认为这不是问题，因为我可以改变 targets 中的因子排序，这样设计矩阵就会创建一个 case_controlCase，这将是相反的比较，这意味着p 值将相同，但 logFC 的方向将被翻转：

#reorder levels in target
levels(targets$case_control) <- sort(levels(targets$case_control),
                                        decreasing=TRUE)
design <- model.matrix(~case_control, data = targets)
y <- edgeR::DGEList(counts = exprDat,
                    group = targets[["case_control"]])
y <- edgeR::calcNormFactors(y,method = 'TMM')
y <- edgeR::estimateDisp(y, design)
fit <- edgeR::glmFit(y, design = design)
test <- edgeR::glmLRT(fit,coef=2)
pvals <- test$table

设计矩阵按预期更新：

> head(design)
  (Intercept) case_controlCase
1           1                1
2           1                1
3           1                1
4           1                1
5           1                1
6           1                1

然而，奇怪的是，这些基因仍然与控制相关：

> pvals
              logFC   logCPM        LR      PValue
gene_1  -0.14418015 16.69933 2.4281485 0.119173587
gene_2  -0.03421562 16.69108 0.1422319 0.706072179
gene_3  -0.12961726 16.69159 1.9632930 0.161161580
gene_4  -0.17710527 16.68963 3.5894597 0.058147147
gene_5  -0.14551401 16.69491 2.4641372 0.116471640
gene_6   0.17585301 16.70497 4.1366713 0.041963611
gene_7  -0.05396444 16.69328 0.3514909 0.553270396
gene_8  -0.15662395 16.69380 2.8394354 0.091976525
gene_9  -0.09823345 16.69603 1.1459499 0.284398595
gene_10 -0.30105913 16.68291 9.8090930 0.001736511

我不知道为什么在 design 发生变化后仍然发生这种情况！如果有人有任何线索那将是惊人的，因为这已经让我头疼了一段时间！或者，如果有人有不同的方法来翻转 logFC，因此它与案例样本而不是控件相关（即确保控件样本被视为 GLM 中的截距），那就太好了。请注意，我知道我可以只交换结果 table 中的符号，但这是我真正想避免的事情，并且更愿意了解我上面的代码出了什么问题。

最后，声明一下，我认为我的问题不是 edgeR 特有的，而是使用 GLM 进行差异分析的一般问题。从根本上说，我只想知道如何使用 GLM 和设计矩阵交换截距系数。为清楚起见，我还将此发布到特定生物分析社区网站 Biostars：https://www.biostars.org/p/9469339/

会话信息：

> sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C               LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8    LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] gridExtra_2.3               reshape2_1.4.4              data.table_1.14.0           Hmisc_4.5-0                
 [5] Formula_1.2-4               survival_3.2-9              lattice_0.20-38             ggrepel_0.9.1              
 [9] viridis_0.6.0               viridisLite_0.4.0           cowplot_1.1.1               ggplot2_3.3.3              
[13] qs_0.24.1                   edgeR_3.28.1                limma_3.42.2                purrr_0.3.4                
[17] magrittr_2.0.1              dplyr_1.0.6                 SingleCellExperiment_1.8.0  SummarizedExperiment_1.16.1
[21] DelayedArray_0.12.3         BiocParallel_1.20.1         matrixStats_0.58.0          Biobase_2.46.0             
[25] biomaRt_2.42.1              BSgenome_1.54.0             rtracklayer_1.46.0          Biostrings_2.54.0          
[29] XVector_0.26.0              GenomicRanges_1.38.0        GenomeInfoDb_1.22.1         IRanges_2.20.2             
[33] S4Vectors_0.24.4            BiocGenerics_0.32.0        

loaded via a namespace (and not attached):
 [1] colorspace_2.0-1         ellipsis_0.3.2           htmlTable_2.1.0          base64enc_0.1-3         
 [5] rstudioapi_0.13          listenv_0.8.0            bit64_4.0.5              AnnotationDbi_1.48.0    
 [9] fansi_0.4.2              codetools_0.2-16         splines_3.6.0            cachem_1.0.4            
[13] knitr_1.33               Rsamtools_2.2.3          cluster_2.0.8            dbplyr_2.1.1            
[17] png_0.1-7                sctransform_0.3.2        BiocManager_1.30.12      compiler_3.6.0          
[21] httr_1.4.2               backports_1.2.1          assertthat_0.2.1         Matrix_1.2-17           
[25] fastmap_1.1.0            cli_2.5.0                htmltools_0.5.1.1        prettyunits_1.1.1       
[29] tools_3.6.0              gtable_0.3.0             glue_1.4.2               GenomeInfoDbData_1.2.2  
[33] rappdirs_0.3.3           Rcpp_1.0.6               vctrs_0.3.7              xfun_0.22               
[37] stringr_1.4.0            globals_0.14.0           lifecycle_1.0.0          pacman_0.5.1            
[41] XML_3.99-0.3             future_1.21.0            zlibbioc_1.32.0          MASS_7.3-51.4           
[45] scales_1.1.1             hms_1.0.0                RColorBrewer_1.1-2       yaml_2.2.1              
[49] curl_4.3.1               memoise_2.0.0            rpart_4.1-15             latticeExtra_0.6-29     
[53] stringi_1.5.3            RSQLite_2.2.4            checkmate_2.0.0          rlang_0.4.11            
[57] pkgconfig_2.0.3          bitops_1.0-7             evaluate_0.14            GenomicAlignments_1.22.1
[61] htmlwidgets_1.5.3        bit_4.0.4                tidyselect_1.1.1         parallelly_1.25.0       
[65] plyr_1.8.6               R6_2.5.0                 generics_0.1.0           DBI_1.1.1               
[69] pillar_1.6.0             foreign_0.8-71           withr_2.4.2              RCurl_1.98-1.3          
[73] nnet_7.3-12              tibble_3.1.1             future.apply_1.7.0       crayon_1.4.1            
[77] utf8_1.2.1               BiocFileCache_1.10.2     RApiSerialize_0.1.0      rmarkdown_2.7           
[81] jpeg_0.1-8.1             progress_1.2.2           locfit_1.5-9.4           grid_3.6.0              
[85] blob_1.2.1               infotheo_1.2.0           digest_0.6.27            openssl_1.4.4           
[89] RcppParallel_5.0.3       munsell_0.5.0            stringfish_0.15.0        askpass_1.1

Answer 1

您正在重命名因子水平，而不是重新调整因子水平。要解决此问题，请尝试：

targets$case_control <- relevel(targets$case_control, "Control")

而不是

levels(targets$case_control) <- sort(levels(targets$case_control), decreasing=TRUE)

（并观察每种情况下对 targets$case_control 的影响）。

差异表达式分析-开关截距系数

Differential expression analysis - switch intercept coefficient

r

bioinformatics

matrix

glm

rna-seq