R 中疾病与控制基因的箱线图

Question

R 的新手，所以对此一无所知。我有 64 个样本（AML 疾病与对照）的数据和每个样本的 22283 个基因表达值。数据看起来像这样。

GSM239170	GSM239323	GSM239324	GSM239326	GSM239328	GSM239329	GSM239331	GSM239332	GSM239333
3.016704177	3.285669072	2.929482692	2.922820483	3.15950317	3.163327169	2.985901308	3.122708843	3.070948463
7.977735461	6.532514237	6.388007183	6.466679556	6.432795021	6.407321524	6.426470803	6.376394357	6.469070308
4.207280707	4.994965767	4.40159671	4.747114589	4.830045513	4.213762092	4.884418365	4.4318876	4.849665444
7.25609471	7.420807337	6.999340125	7.094488581	7.024332721	7.17928981	7.159898654	7.009977785	6.830979234
2.204955099	2.331625217	2.133305231	2.18332885	2.12778313	2.269697813	2.264705552	2.253940441	2.287924323
7.28437278	6.983593721	6.86337111	6.865970678	7.219840938	7.181113053	7.392230178	7.484052914	7.52498281
4.265792764	4.970684112	4.595545125	4.575545289	4.547957809	4.68215122	4.674495889	4.675841709	4.643311767
2.6943516	2.916324936	2.578130269	2.659717988	2.567436676	2.8095128	2.790110381	2.795882913	2.884588792
3.646303109	8.817891552	11.4248793	10.74738082	9.296043108	9.53150669	8.285160496	9.769919327	9.774610531
3.040292001	3.38486713	2.958851115	3.047880699	2.878562717	3.209319974	3.20260379	3.195993624	3.3004227
2.357625231	2.444753172	2.340767158	2.32143889	2.282608342	2.401218719	2.385568421	2.375334953	2.432634747
5.378494673	6.065038394	5.134842087	5.367342376	5.682051149	5.712072512	5.57179966	5.72082395	5.656674512
2.833814735	3.038434511	2.837711812	2.859800224	2.866040813	2.969167906	2.929449968	2.963530689	2.931065261
6.192932281	6.478439634	6.180169144	6.151689376	6.238949956	6.708196123	6.441437631	6.448280595	6.413562269
4.543042482	4.786227217	4.445131477	4.51471011	4.491645167	4.460114204	4.602482637	4.587221948	4.623125028
6.069437462	6.232738284	6.74644117	7.04995802	6.938928532	6.348253102	6.080950712	6.324619355	6.472893789

其中 (GSM239170、GSM239323、GSM239324、GSM239332、GSM239333) 是 AML 样本并且 (GSM239326、GSM239328、GSM239329、GSM239331)是对照样品。我想为所有样本绘制基因表达箱线图，将 AML 样本的数据点标记为红色，对照样本的数据点标记为蓝色。

我尝试了下面的代码，但出现了错误。

boxplot(df1, main = "Boxplot")

Error in x[floor(d)] + x[ceiling(d)] : non-numeric argument to binary operator

使用这段代码，

meltData <- melt(df1)
boxplot(meltData, main = "Boxplot")
# Error
meltData <- melt(df1)
Using Probe_ID as id variables
boxplot(meltData, main = "Boxplot")
Error in x[floor(d)] + x[ceiling(d)] : 
  non-numeric argument to binary operator

如何制作箱线图？

谢谢。

Answer 1

您可以通过旋转数据、在新变量中重新编码并绘图来实现这一点。我正在使用 tidyverse 包集合（tidyr、dplyr、ggplot2 等）来做这个例子。

首先我们加载包，我正在为这个例子创建一个时态数据集。您可以将此应用于您自己的数据。

假设我有一个包含四列的数据集。

library(tidyverse)

tmp <- data.frame(col1 = runif(10),
           col2 = runif(10),
           col3 = runif(10),
           col4 = runif(10)) 

tmp
#>          col1      col2      col3      col4
#> 1  0.90734014 0.5982927 0.8737742 0.4736948
#> 2  0.86152607 0.6449016 0.8081259 0.4717687
#> 3  0.82432693 0.1984189 0.8673977 0.2029126
#> 4  0.85152294 0.2304586 0.6021387 0.3914528
#> 5  0.38723829 0.5963829 0.6431011 0.8964814
#> 6  0.45445156 0.2750627 0.8670836 0.3399746
#> 7  0.38512391 0.5757645 0.9060044 0.1580425
#> 8  0.58636619 0.4287894 0.7827405 0.6647625
#> 9  0.50576266 0.6141626 0.5411717 0.4168770
#> 10 0.06328743 0.1455354 0.7490322 0.9065379

我知道 col1 和 col2 是对照组，所以我需要旋转所有列并创建一个 type 列来识别对照组和治疗组。我之前创建了 control_cols 向量以获得更清晰的代码。我认为如果您是 R 的新手，最快的方法是手动指定 32 个名称。如果您碰巧知道正则表达式，您可以通过在 colnames 列中使用 str_detect() 来利用它。

control_cols <- c("col1", "col2")

tmp_transformed <- tmp %>% 
  pivot_longer(everything(), names_to = "colname", values_to = "value") %>% 
  mutate(type = if_else(colname %in% control_cols, "control", "treatment"))

tmp_transformed
#> # A tibble: 40 x 3
#>    colname value type     
#>    <chr>   <dbl> <chr>    
#>  1 col1    0.907 control  
#>  2 col2    0.598 control  
#>  3 col3    0.874 treatment
#>  4 col4    0.474 treatment
#>  5 col1    0.862 control  
#>  6 col2    0.645 control  
#>  7 col3    0.808 treatment
#>  8 col4    0.472 treatment
#>  9 col1    0.824 control  
#> 10 col2    0.198 control  
#> # ... with 30 more rows

一旦我的数据准备就绪，我现在可以创建一个箱线图，其中每个填充颜色都映射到一个组类型。您可以在 scale_fill_manual().

中手动指定颜色

对照与治疗：

tmp_transformed %>% 
  ggplot(aes(type, value, fill = type)) +
  geom_boxplot() +
  scale_fill_manual(values = c("cyan", "red"))

对于 colname 和类型

tmp_transformed %>% 
  ggplot(aes(colname, value, fill = type)) +
  geom_boxplot() +
  scale_fill_manual(values = c("cyan", "red"))

^{由 reprex package (v2.0.1)}

创建于 2022-01-20

Answer 2

要分别为 AML 和控件着色红色和蓝色，请将 names(df1) 与 AML 向量匹配并使用它来索引颜色值向量。

meltData <- reshape2::melt(df1)

aml <- scan(text="GSM239170, GSM239323, GSM239324, GSM239332, GSM239333", what = character(), sep = ",")
aml <- trimws(aml)
i_aml <- (names(df1) %in% aml) + 1L
colors <- c("blue", "red")

boxplot(value ~ variable, meltData, main = "Boxplot", col = colors[i_aml])

数据

df1 <-
structure(list(GSM239170 = c(3.016704177, 7.977735461, 4.207280707, 
7.25609471, 2.204955099, 7.28437278, 4.265792764, 2.6943516, 
3.646303109, 3.040292001, 2.357625231, 5.378494673, 2.833814735, 
6.192932281, 4.543042482, 6.069437462), GSM239323 = c(3.285669072, 
6.532514237, 4.994965767, 7.420807337, 2.331625217, 6.983593721, 
4.970684112, 2.916324936, 8.817891552, 3.38486713, 2.444753172, 
6.065038394, 3.038434511, 6.478439634, 4.786227217, 6.232738284
), GSM239324 = c(2.929482692, 6.388007183, 4.40159671, 6.999340125, 
2.133305231, 6.86337111, 4.595545125, 2.578130269, 11.4248793, 
2.958851115, 2.340767158, 5.134842087, 2.837711812, 6.180169144, 
4.445131477, 6.74644117), GSM239326 = c(2.922820483, 6.466679556, 
4.747114589, 7.094488581, 2.18332885, 6.865970678, 4.575545289, 
2.659717988, 10.74738082, 3.047880699, 2.32143889, 5.367342376, 
2.859800224, 6.151689376, 4.51471011, 7.04995802), GSM239328 = c(3.15950317, 
6.432795021, 4.830045513, 7.024332721, 2.12778313, 7.219840938, 
4.547957809, 2.567436676, 9.296043108, 2.878562717, 2.282608342, 
5.682051149, 2.866040813, 6.238949956, 4.491645167, 6.938928532
), GSM239329 = c(3.163327169, 6.407321524, 4.213762092, 7.17928981, 
2.269697813, 7.181113053, 4.68215122, 2.8095128, 9.53150669, 
3.209319974, 2.401218719, 5.712072512, 2.969167906, 6.708196123, 
4.460114204, 6.348253102), GSM239331 = c(2.985901308, 6.426470803, 
4.884418365, 7.159898654, 2.264705552, 7.392230178, 4.674495889, 
2.790110381, 8.285160496, 3.20260379, 2.385568421, 5.57179966, 
2.929449968, 6.441437631, 4.602482637, 6.080950712), GSM239332 = c(3.122708843, 
6.376394357, 4.4318876, 7.009977785, 2.253940441, 7.484052914, 
4.675841709, 2.795882913, 9.769919327, 3.195993624, 2.375334953, 
5.72082395, 2.963530689, 6.448280595, 4.587221948, 6.324619355
), GSM239333 = c(3.070948463, 6.469070308, 4.849665444, 6.830979234, 
2.287924323, 7.52498281, 4.643311767, 2.884588792, 9.774610531, 
3.3004227, 2.432634747, 5.656674512, 2.931065261, 6.413562269, 
4.623125028, 6.472893789)), class = "data.frame", row.names = c(NA, -16L))

R 中疾病与控制基因的箱线图

Boxplot of disease vs control genes in R

r

boxplot

数据