差异表达式分析-开关截距系数
Differential expression analysis - switch intercept coefficient
我正在尝试使用 edgeR 对生物计数数据集进行差异表达分析。我的样本分为病例和对照,我想知道病例样本(即有条件的样本)与对照中上调或下调的基因。但是,我遇到一个问题,即当前基因的结果与对照样本相关,而不是使用 edgeR
时的情况。我可以用假数据在 R 中重现这个问题。
假数据在对照中的计数值低于案例样本,因此我们预计所有基因在案例样本中都会上调:
#First create the expression matrix
set.seed(101)
#create data so first 50 (the controls) have lower values than second 50 samples (those with the condition)
exprDat <- cbind(matrix(round((runif(500)/10)*100),ncol=50),
matrix(round((1-runif(500)/10)*100),ncol=50))
colnames(exprDat) <- paste0("sample_",1:100)
rownames(exprDat) <- paste0("gene_",1:10)
#Now create the annotation dataset
targets <- data.frame("group_sample"=colnames(exprDat),
"case_control"=as.factor(c(rep("Control",50),
rep("Case",50))))
#create the design matrix comparing case and control
design <- model.matrix(~case_control, data = targets)
y <- edgeR::DGEList(counts = exprDat,
group = targets[["case_control"]])
#normalise
y <- edgeR::calcNormFactors(y,method = 'TMM')
y <- edgeR::estimateDisp(y, design)
#build linear model
fit <- edgeR::glmFit(y, design = design)
#test the comparison, coef=1 is the intercept
test <- edgeR::glmLRT(fit,coef=2)
pvals <- test$table
我对上面的问题是 logFC 都与对照样本有关,而不是与案例有关。我们首先可以在设计中看到这一点,因为该列是 case_controlControl
:
> head(design)
(Intercept) case_controlControl
1 1 1
2 1 1
3 1 1
4 1 1
5 1 1
6 1 1
然后 logFC 表明这些基因在对照样本中与病例相比下调:
> pvals
logFC logCPM LR PValue
gene_1 -0.14418015 16.69933 2.4281485 0.119173587
gene_2 -0.03421562 16.69108 0.1422319 0.706072179
gene_3 -0.12961726 16.69159 1.9632930 0.161161580
gene_4 -0.17710527 16.68963 3.5894597 0.058147147
gene_5 -0.14551401 16.69491 2.4641372 0.116471640
gene_6 0.17585301 16.70497 4.1366713 0.041963611
gene_7 -0.05396444 16.69328 0.3514909 0.553270396
gene_8 -0.15662395 16.69380 2.8394354 0.091976525
gene_9 -0.09823345 16.69603 1.1459499 0.284398595
gene_10 -0.30105913 16.68291 9.8090930 0.001736511
起初我认为这不是问题,因为我可以改变 targets
中的因子排序,这样设计矩阵就会创建一个 case_controlCase
,这将是相反的比较,这意味着p 值将相同,但 logFC 的方向将被翻转:
#reorder levels in target
levels(targets$case_control) <- sort(levels(targets$case_control),
decreasing=TRUE)
design <- model.matrix(~case_control, data = targets)
y <- edgeR::DGEList(counts = exprDat,
group = targets[["case_control"]])
y <- edgeR::calcNormFactors(y,method = 'TMM')
y <- edgeR::estimateDisp(y, design)
fit <- edgeR::glmFit(y, design = design)
test <- edgeR::glmLRT(fit,coef=2)
pvals <- test$table
设计矩阵按预期更新:
> head(design)
(Intercept) case_controlCase
1 1 1
2 1 1
3 1 1
4 1 1
5 1 1
6 1 1
然而,奇怪的是,这些基因仍然与控制相关:
> pvals
logFC logCPM LR PValue
gene_1 -0.14418015 16.69933 2.4281485 0.119173587
gene_2 -0.03421562 16.69108 0.1422319 0.706072179
gene_3 -0.12961726 16.69159 1.9632930 0.161161580
gene_4 -0.17710527 16.68963 3.5894597 0.058147147
gene_5 -0.14551401 16.69491 2.4641372 0.116471640
gene_6 0.17585301 16.70497 4.1366713 0.041963611
gene_7 -0.05396444 16.69328 0.3514909 0.553270396
gene_8 -0.15662395 16.69380 2.8394354 0.091976525
gene_9 -0.09823345 16.69603 1.1459499 0.284398595
gene_10 -0.30105913 16.68291 9.8090930 0.001736511
我不知道为什么在 design
发生变化后仍然发生这种情况!如果有人有任何线索那将是惊人的,因为这已经让我头疼了一段时间!或者,如果有人有不同的方法来翻转 logFC,因此它与案例样本而不是控件相关(即确保控件样本被视为 GLM 中的截距),那就太好了。请注意,我知道我可以只交换结果 table 中的符号,但这是我真正想避免的事情,并且更愿意了解我上面的代码出了什么问题。
最后,声明一下,我认为我的问题不是 edgeR
特有的,而是使用 GLM 进行差异分析的一般问题。从根本上说,我只想知道如何使用 GLM 和设计矩阵交换截距系数。为清楚起见,我还将此发布到特定生物分析社区网站 Biostars:https://www.biostars.org/p/9469339/
会话信息:
> sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)
Matrix products: default
BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so
locale:
[1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
[5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 LC_PAPER=en_GB.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] gridExtra_2.3 reshape2_1.4.4 data.table_1.14.0 Hmisc_4.5-0
[5] Formula_1.2-4 survival_3.2-9 lattice_0.20-38 ggrepel_0.9.1
[9] viridis_0.6.0 viridisLite_0.4.0 cowplot_1.1.1 ggplot2_3.3.3
[13] qs_0.24.1 edgeR_3.28.1 limma_3.42.2 purrr_0.3.4
[17] magrittr_2.0.1 dplyr_1.0.6 SingleCellExperiment_1.8.0 SummarizedExperiment_1.16.1
[21] DelayedArray_0.12.3 BiocParallel_1.20.1 matrixStats_0.58.0 Biobase_2.46.0
[25] biomaRt_2.42.1 BSgenome_1.54.0 rtracklayer_1.46.0 Biostrings_2.54.0
[29] XVector_0.26.0 GenomicRanges_1.38.0 GenomeInfoDb_1.22.1 IRanges_2.20.2
[33] S4Vectors_0.24.4 BiocGenerics_0.32.0
loaded via a namespace (and not attached):
[1] colorspace_2.0-1 ellipsis_0.3.2 htmlTable_2.1.0 base64enc_0.1-3
[5] rstudioapi_0.13 listenv_0.8.0 bit64_4.0.5 AnnotationDbi_1.48.0
[9] fansi_0.4.2 codetools_0.2-16 splines_3.6.0 cachem_1.0.4
[13] knitr_1.33 Rsamtools_2.2.3 cluster_2.0.8 dbplyr_2.1.1
[17] png_0.1-7 sctransform_0.3.2 BiocManager_1.30.12 compiler_3.6.0
[21] httr_1.4.2 backports_1.2.1 assertthat_0.2.1 Matrix_1.2-17
[25] fastmap_1.1.0 cli_2.5.0 htmltools_0.5.1.1 prettyunits_1.1.1
[29] tools_3.6.0 gtable_0.3.0 glue_1.4.2 GenomeInfoDbData_1.2.2
[33] rappdirs_0.3.3 Rcpp_1.0.6 vctrs_0.3.7 xfun_0.22
[37] stringr_1.4.0 globals_0.14.0 lifecycle_1.0.0 pacman_0.5.1
[41] XML_3.99-0.3 future_1.21.0 zlibbioc_1.32.0 MASS_7.3-51.4
[45] scales_1.1.1 hms_1.0.0 RColorBrewer_1.1-2 yaml_2.2.1
[49] curl_4.3.1 memoise_2.0.0 rpart_4.1-15 latticeExtra_0.6-29
[53] stringi_1.5.3 RSQLite_2.2.4 checkmate_2.0.0 rlang_0.4.11
[57] pkgconfig_2.0.3 bitops_1.0-7 evaluate_0.14 GenomicAlignments_1.22.1
[61] htmlwidgets_1.5.3 bit_4.0.4 tidyselect_1.1.1 parallelly_1.25.0
[65] plyr_1.8.6 R6_2.5.0 generics_0.1.0 DBI_1.1.1
[69] pillar_1.6.0 foreign_0.8-71 withr_2.4.2 RCurl_1.98-1.3
[73] nnet_7.3-12 tibble_3.1.1 future.apply_1.7.0 crayon_1.4.1
[77] utf8_1.2.1 BiocFileCache_1.10.2 RApiSerialize_0.1.0 rmarkdown_2.7
[81] jpeg_0.1-8.1 progress_1.2.2 locfit_1.5-9.4 grid_3.6.0
[85] blob_1.2.1 infotheo_1.2.0 digest_0.6.27 openssl_1.4.4
[89] RcppParallel_5.0.3 munsell_0.5.0 stringfish_0.15.0 askpass_1.1
您正在重命名因子水平,而不是重新调整因子水平。要解决此问题,请尝试:
targets$case_control <- relevel(targets$case_control, "Control")
而不是
levels(targets$case_control) <- sort(levels(targets$case_control), decreasing=TRUE)
(并观察每种情况下对 targets$case_control
的影响)。
我正在尝试使用 edgeR 对生物计数数据集进行差异表达分析。我的样本分为病例和对照,我想知道病例样本(即有条件的样本)与对照中上调或下调的基因。但是,我遇到一个问题,即当前基因的结果与对照样本相关,而不是使用 edgeR
时的情况。我可以用假数据在 R 中重现这个问题。
假数据在对照中的计数值低于案例样本,因此我们预计所有基因在案例样本中都会上调:
#First create the expression matrix
set.seed(101)
#create data so first 50 (the controls) have lower values than second 50 samples (those with the condition)
exprDat <- cbind(matrix(round((runif(500)/10)*100),ncol=50),
matrix(round((1-runif(500)/10)*100),ncol=50))
colnames(exprDat) <- paste0("sample_",1:100)
rownames(exprDat) <- paste0("gene_",1:10)
#Now create the annotation dataset
targets <- data.frame("group_sample"=colnames(exprDat),
"case_control"=as.factor(c(rep("Control",50),
rep("Case",50))))
#create the design matrix comparing case and control
design <- model.matrix(~case_control, data = targets)
y <- edgeR::DGEList(counts = exprDat,
group = targets[["case_control"]])
#normalise
y <- edgeR::calcNormFactors(y,method = 'TMM')
y <- edgeR::estimateDisp(y, design)
#build linear model
fit <- edgeR::glmFit(y, design = design)
#test the comparison, coef=1 is the intercept
test <- edgeR::glmLRT(fit,coef=2)
pvals <- test$table
我对上面的问题是 logFC 都与对照样本有关,而不是与案例有关。我们首先可以在设计中看到这一点,因为该列是 case_controlControl
:
> head(design)
(Intercept) case_controlControl
1 1 1
2 1 1
3 1 1
4 1 1
5 1 1
6 1 1
然后 logFC 表明这些基因在对照样本中与病例相比下调:
> pvals
logFC logCPM LR PValue
gene_1 -0.14418015 16.69933 2.4281485 0.119173587
gene_2 -0.03421562 16.69108 0.1422319 0.706072179
gene_3 -0.12961726 16.69159 1.9632930 0.161161580
gene_4 -0.17710527 16.68963 3.5894597 0.058147147
gene_5 -0.14551401 16.69491 2.4641372 0.116471640
gene_6 0.17585301 16.70497 4.1366713 0.041963611
gene_7 -0.05396444 16.69328 0.3514909 0.553270396
gene_8 -0.15662395 16.69380 2.8394354 0.091976525
gene_9 -0.09823345 16.69603 1.1459499 0.284398595
gene_10 -0.30105913 16.68291 9.8090930 0.001736511
起初我认为这不是问题,因为我可以改变 targets
中的因子排序,这样设计矩阵就会创建一个 case_controlCase
,这将是相反的比较,这意味着p 值将相同,但 logFC 的方向将被翻转:
#reorder levels in target
levels(targets$case_control) <- sort(levels(targets$case_control),
decreasing=TRUE)
design <- model.matrix(~case_control, data = targets)
y <- edgeR::DGEList(counts = exprDat,
group = targets[["case_control"]])
y <- edgeR::calcNormFactors(y,method = 'TMM')
y <- edgeR::estimateDisp(y, design)
fit <- edgeR::glmFit(y, design = design)
test <- edgeR::glmLRT(fit,coef=2)
pvals <- test$table
设计矩阵按预期更新:
> head(design)
(Intercept) case_controlCase
1 1 1
2 1 1
3 1 1
4 1 1
5 1 1
6 1 1
然而,奇怪的是,这些基因仍然与控制相关:
> pvals
logFC logCPM LR PValue
gene_1 -0.14418015 16.69933 2.4281485 0.119173587
gene_2 -0.03421562 16.69108 0.1422319 0.706072179
gene_3 -0.12961726 16.69159 1.9632930 0.161161580
gene_4 -0.17710527 16.68963 3.5894597 0.058147147
gene_5 -0.14551401 16.69491 2.4641372 0.116471640
gene_6 0.17585301 16.70497 4.1366713 0.041963611
gene_7 -0.05396444 16.69328 0.3514909 0.553270396
gene_8 -0.15662395 16.69380 2.8394354 0.091976525
gene_9 -0.09823345 16.69603 1.1459499 0.284398595
gene_10 -0.30105913 16.68291 9.8090930 0.001736511
我不知道为什么在 design
发生变化后仍然发生这种情况!如果有人有任何线索那将是惊人的,因为这已经让我头疼了一段时间!或者,如果有人有不同的方法来翻转 logFC,因此它与案例样本而不是控件相关(即确保控件样本被视为 GLM 中的截距),那就太好了。请注意,我知道我可以只交换结果 table 中的符号,但这是我真正想避免的事情,并且更愿意了解我上面的代码出了什么问题。
最后,声明一下,我认为我的问题不是 edgeR
特有的,而是使用 GLM 进行差异分析的一般问题。从根本上说,我只想知道如何使用 GLM 和设计矩阵交换截距系数。为清楚起见,我还将此发布到特定生物分析社区网站 Biostars:https://www.biostars.org/p/9469339/
会话信息:
> sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)
Matrix products: default
BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so
locale:
[1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
[5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 LC_PAPER=en_GB.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] gridExtra_2.3 reshape2_1.4.4 data.table_1.14.0 Hmisc_4.5-0
[5] Formula_1.2-4 survival_3.2-9 lattice_0.20-38 ggrepel_0.9.1
[9] viridis_0.6.0 viridisLite_0.4.0 cowplot_1.1.1 ggplot2_3.3.3
[13] qs_0.24.1 edgeR_3.28.1 limma_3.42.2 purrr_0.3.4
[17] magrittr_2.0.1 dplyr_1.0.6 SingleCellExperiment_1.8.0 SummarizedExperiment_1.16.1
[21] DelayedArray_0.12.3 BiocParallel_1.20.1 matrixStats_0.58.0 Biobase_2.46.0
[25] biomaRt_2.42.1 BSgenome_1.54.0 rtracklayer_1.46.0 Biostrings_2.54.0
[29] XVector_0.26.0 GenomicRanges_1.38.0 GenomeInfoDb_1.22.1 IRanges_2.20.2
[33] S4Vectors_0.24.4 BiocGenerics_0.32.0
loaded via a namespace (and not attached):
[1] colorspace_2.0-1 ellipsis_0.3.2 htmlTable_2.1.0 base64enc_0.1-3
[5] rstudioapi_0.13 listenv_0.8.0 bit64_4.0.5 AnnotationDbi_1.48.0
[9] fansi_0.4.2 codetools_0.2-16 splines_3.6.0 cachem_1.0.4
[13] knitr_1.33 Rsamtools_2.2.3 cluster_2.0.8 dbplyr_2.1.1
[17] png_0.1-7 sctransform_0.3.2 BiocManager_1.30.12 compiler_3.6.0
[21] httr_1.4.2 backports_1.2.1 assertthat_0.2.1 Matrix_1.2-17
[25] fastmap_1.1.0 cli_2.5.0 htmltools_0.5.1.1 prettyunits_1.1.1
[29] tools_3.6.0 gtable_0.3.0 glue_1.4.2 GenomeInfoDbData_1.2.2
[33] rappdirs_0.3.3 Rcpp_1.0.6 vctrs_0.3.7 xfun_0.22
[37] stringr_1.4.0 globals_0.14.0 lifecycle_1.0.0 pacman_0.5.1
[41] XML_3.99-0.3 future_1.21.0 zlibbioc_1.32.0 MASS_7.3-51.4
[45] scales_1.1.1 hms_1.0.0 RColorBrewer_1.1-2 yaml_2.2.1
[49] curl_4.3.1 memoise_2.0.0 rpart_4.1-15 latticeExtra_0.6-29
[53] stringi_1.5.3 RSQLite_2.2.4 checkmate_2.0.0 rlang_0.4.11
[57] pkgconfig_2.0.3 bitops_1.0-7 evaluate_0.14 GenomicAlignments_1.22.1
[61] htmlwidgets_1.5.3 bit_4.0.4 tidyselect_1.1.1 parallelly_1.25.0
[65] plyr_1.8.6 R6_2.5.0 generics_0.1.0 DBI_1.1.1
[69] pillar_1.6.0 foreign_0.8-71 withr_2.4.2 RCurl_1.98-1.3
[73] nnet_7.3-12 tibble_3.1.1 future.apply_1.7.0 crayon_1.4.1
[77] utf8_1.2.1 BiocFileCache_1.10.2 RApiSerialize_0.1.0 rmarkdown_2.7
[81] jpeg_0.1-8.1 progress_1.2.2 locfit_1.5-9.4 grid_3.6.0
[85] blob_1.2.1 infotheo_1.2.0 digest_0.6.27 openssl_1.4.4
[89] RcppParallel_5.0.3 munsell_0.5.0 stringfish_0.15.0 askpass_1.1
您正在重命名因子水平,而不是重新调整因子水平。要解决此问题,请尝试:
targets$case_control <- relevel(targets$case_control, "Control")
而不是
levels(targets$case_control) <- sort(levels(targets$case_control), decreasing=TRUE)
(并观察每种情况下对 targets$case_control
的影响)。