R 中疾病与控制基因的箱线图
Boxplot of disease vs control genes in R
R 的新手,所以对此一无所知。我有 64 个样本(AML 疾病与对照)的数据和每个样本的 22283 个基因表达值。数据看起来像这样。
GSM239170
GSM239323
GSM239324
GSM239326
GSM239328
GSM239329
GSM239331
GSM239332
GSM239333
3.016704177
3.285669072
2.929482692
2.922820483
3.15950317
3.163327169
2.985901308
3.122708843
3.070948463
7.977735461
6.532514237
6.388007183
6.466679556
6.432795021
6.407321524
6.426470803
6.376394357
6.469070308
4.207280707
4.994965767
4.40159671
4.747114589
4.830045513
4.213762092
4.884418365
4.4318876
4.849665444
7.25609471
7.420807337
6.999340125
7.094488581
7.024332721
7.17928981
7.159898654
7.009977785
6.830979234
2.204955099
2.331625217
2.133305231
2.18332885
2.12778313
2.269697813
2.264705552
2.253940441
2.287924323
7.28437278
6.983593721
6.86337111
6.865970678
7.219840938
7.181113053
7.392230178
7.484052914
7.52498281
4.265792764
4.970684112
4.595545125
4.575545289
4.547957809
4.68215122
4.674495889
4.675841709
4.643311767
2.6943516
2.916324936
2.578130269
2.659717988
2.567436676
2.8095128
2.790110381
2.795882913
2.884588792
3.646303109
8.817891552
11.4248793
10.74738082
9.296043108
9.53150669
8.285160496
9.769919327
9.774610531
3.040292001
3.38486713
2.958851115
3.047880699
2.878562717
3.209319974
3.20260379
3.195993624
3.3004227
2.357625231
2.444753172
2.340767158
2.32143889
2.282608342
2.401218719
2.385568421
2.375334953
2.432634747
5.378494673
6.065038394
5.134842087
5.367342376
5.682051149
5.712072512
5.57179966
5.72082395
5.656674512
2.833814735
3.038434511
2.837711812
2.859800224
2.866040813
2.969167906
2.929449968
2.963530689
2.931065261
6.192932281
6.478439634
6.180169144
6.151689376
6.238949956
6.708196123
6.441437631
6.448280595
6.413562269
4.543042482
4.786227217
4.445131477
4.51471011
4.491645167
4.460114204
4.602482637
4.587221948
4.623125028
6.069437462
6.232738284
6.74644117
7.04995802
6.938928532
6.348253102
6.080950712
6.324619355
6.472893789
其中 (GSM239170、GSM239323、GSM239324、GSM239332、GSM239333) 是 AML 样本并且 (GSM239326、GSM239328、GSM239329、GSM239331)是对照样品。我想为所有样本绘制基因表达箱线图,将 AML 样本的数据点标记为红色,对照样本的数据点标记为蓝色。
我尝试了下面的代码,但出现了错误。
boxplot(df1, main = "Boxplot")
Error in x[floor(d)] + x[ceiling(d)] : non-numeric argument to
binary operator
使用这段代码,
meltData <- melt(df1)
boxplot(meltData, main = "Boxplot")
# Error
meltData <- melt(df1)
Using Probe_ID as id variables
boxplot(meltData, main = "Boxplot")
Error in x[floor(d)] + x[ceiling(d)] :
non-numeric argument to binary operator
如何制作箱线图?
谢谢。
您可以通过旋转数据、在新变量中重新编码并绘图来实现这一点。我正在使用 tidyverse
包集合(tidyr、dplyr、ggplot2 等)来做这个例子。
首先我们加载包,我正在为这个例子创建一个时态数据集。您可以将此应用于您自己的数据。
假设我有一个包含四列的数据集。
library(tidyverse)
tmp <- data.frame(col1 = runif(10),
col2 = runif(10),
col3 = runif(10),
col4 = runif(10))
tmp
#> col1 col2 col3 col4
#> 1 0.90734014 0.5982927 0.8737742 0.4736948
#> 2 0.86152607 0.6449016 0.8081259 0.4717687
#> 3 0.82432693 0.1984189 0.8673977 0.2029126
#> 4 0.85152294 0.2304586 0.6021387 0.3914528
#> 5 0.38723829 0.5963829 0.6431011 0.8964814
#> 6 0.45445156 0.2750627 0.8670836 0.3399746
#> 7 0.38512391 0.5757645 0.9060044 0.1580425
#> 8 0.58636619 0.4287894 0.7827405 0.6647625
#> 9 0.50576266 0.6141626 0.5411717 0.4168770
#> 10 0.06328743 0.1455354 0.7490322 0.9065379
我知道 col1 和 col2 是对照组,所以我需要旋转所有列并创建一个 type 列来识别对照组和治疗组。我之前创建了 control_cols
向量以获得更清晰的代码。我认为如果您是 R 的新手,最快的方法是手动指定 32 个名称。如果您碰巧知道正则表达式,您可以通过在 colnames 列中使用 str_detect()
来利用它。
control_cols <- c("col1", "col2")
tmp_transformed <- tmp %>%
pivot_longer(everything(), names_to = "colname", values_to = "value") %>%
mutate(type = if_else(colname %in% control_cols, "control", "treatment"))
tmp_transformed
#> # A tibble: 40 x 3
#> colname value type
#> <chr> <dbl> <chr>
#> 1 col1 0.907 control
#> 2 col2 0.598 control
#> 3 col3 0.874 treatment
#> 4 col4 0.474 treatment
#> 5 col1 0.862 control
#> 6 col2 0.645 control
#> 7 col3 0.808 treatment
#> 8 col4 0.472 treatment
#> 9 col1 0.824 control
#> 10 col2 0.198 control
#> # ... with 30 more rows
一旦我的数据准备就绪,我现在可以创建一个箱线图,其中每个填充颜色都映射到一个组类型。您可以在 scale_fill_manual()
.
中手动指定颜色
对照与治疗:
tmp_transformed %>%
ggplot(aes(type, value, fill = type)) +
geom_boxplot() +
scale_fill_manual(values = c("cyan", "red"))
对于 colname 和类型
tmp_transformed %>%
ggplot(aes(colname, value, fill = type)) +
geom_boxplot() +
scale_fill_manual(values = c("cyan", "red"))
由 reprex package (v2.0.1)
创建于 2022-01-20
要分别为 AML 和控件着色红色和蓝色,请将 names(df1)
与 AML 向量匹配并使用它来索引颜色值向量。
meltData <- reshape2::melt(df1)
aml <- scan(text="GSM239170, GSM239323, GSM239324, GSM239332, GSM239333", what = character(), sep = ",")
aml <- trimws(aml)
i_aml <- (names(df1) %in% aml) + 1L
colors <- c("blue", "red")
boxplot(value ~ variable, meltData, main = "Boxplot", col = colors[i_aml])
数据
df1 <-
structure(list(GSM239170 = c(3.016704177, 7.977735461, 4.207280707,
7.25609471, 2.204955099, 7.28437278, 4.265792764, 2.6943516,
3.646303109, 3.040292001, 2.357625231, 5.378494673, 2.833814735,
6.192932281, 4.543042482, 6.069437462), GSM239323 = c(3.285669072,
6.532514237, 4.994965767, 7.420807337, 2.331625217, 6.983593721,
4.970684112, 2.916324936, 8.817891552, 3.38486713, 2.444753172,
6.065038394, 3.038434511, 6.478439634, 4.786227217, 6.232738284
), GSM239324 = c(2.929482692, 6.388007183, 4.40159671, 6.999340125,
2.133305231, 6.86337111, 4.595545125, 2.578130269, 11.4248793,
2.958851115, 2.340767158, 5.134842087, 2.837711812, 6.180169144,
4.445131477, 6.74644117), GSM239326 = c(2.922820483, 6.466679556,
4.747114589, 7.094488581, 2.18332885, 6.865970678, 4.575545289,
2.659717988, 10.74738082, 3.047880699, 2.32143889, 5.367342376,
2.859800224, 6.151689376, 4.51471011, 7.04995802), GSM239328 = c(3.15950317,
6.432795021, 4.830045513, 7.024332721, 2.12778313, 7.219840938,
4.547957809, 2.567436676, 9.296043108, 2.878562717, 2.282608342,
5.682051149, 2.866040813, 6.238949956, 4.491645167, 6.938928532
), GSM239329 = c(3.163327169, 6.407321524, 4.213762092, 7.17928981,
2.269697813, 7.181113053, 4.68215122, 2.8095128, 9.53150669,
3.209319974, 2.401218719, 5.712072512, 2.969167906, 6.708196123,
4.460114204, 6.348253102), GSM239331 = c(2.985901308, 6.426470803,
4.884418365, 7.159898654, 2.264705552, 7.392230178, 4.674495889,
2.790110381, 8.285160496, 3.20260379, 2.385568421, 5.57179966,
2.929449968, 6.441437631, 4.602482637, 6.080950712), GSM239332 = c(3.122708843,
6.376394357, 4.4318876, 7.009977785, 2.253940441, 7.484052914,
4.675841709, 2.795882913, 9.769919327, 3.195993624, 2.375334953,
5.72082395, 2.963530689, 6.448280595, 4.587221948, 6.324619355
), GSM239333 = c(3.070948463, 6.469070308, 4.849665444, 6.830979234,
2.287924323, 7.52498281, 4.643311767, 2.884588792, 9.774610531,
3.3004227, 2.432634747, 5.656674512, 2.931065261, 6.413562269,
4.623125028, 6.472893789)), class = "data.frame", row.names = c(NA, -16L))
R 的新手,所以对此一无所知。我有 64 个样本(AML 疾病与对照)的数据和每个样本的 22283 个基因表达值。数据看起来像这样。
GSM239170 | GSM239323 | GSM239324 | GSM239326 | GSM239328 | GSM239329 | GSM239331 | GSM239332 | GSM239333 |
---|---|---|---|---|---|---|---|---|
3.016704177 | 3.285669072 | 2.929482692 | 2.922820483 | 3.15950317 | 3.163327169 | 2.985901308 | 3.122708843 | 3.070948463 |
7.977735461 | 6.532514237 | 6.388007183 | 6.466679556 | 6.432795021 | 6.407321524 | 6.426470803 | 6.376394357 | 6.469070308 |
4.207280707 | 4.994965767 | 4.40159671 | 4.747114589 | 4.830045513 | 4.213762092 | 4.884418365 | 4.4318876 | 4.849665444 |
7.25609471 | 7.420807337 | 6.999340125 | 7.094488581 | 7.024332721 | 7.17928981 | 7.159898654 | 7.009977785 | 6.830979234 |
2.204955099 | 2.331625217 | 2.133305231 | 2.18332885 | 2.12778313 | 2.269697813 | 2.264705552 | 2.253940441 | 2.287924323 |
7.28437278 | 6.983593721 | 6.86337111 | 6.865970678 | 7.219840938 | 7.181113053 | 7.392230178 | 7.484052914 | 7.52498281 |
4.265792764 | 4.970684112 | 4.595545125 | 4.575545289 | 4.547957809 | 4.68215122 | 4.674495889 | 4.675841709 | 4.643311767 |
2.6943516 | 2.916324936 | 2.578130269 | 2.659717988 | 2.567436676 | 2.8095128 | 2.790110381 | 2.795882913 | 2.884588792 |
3.646303109 | 8.817891552 | 11.4248793 | 10.74738082 | 9.296043108 | 9.53150669 | 8.285160496 | 9.769919327 | 9.774610531 |
3.040292001 | 3.38486713 | 2.958851115 | 3.047880699 | 2.878562717 | 3.209319974 | 3.20260379 | 3.195993624 | 3.3004227 |
2.357625231 | 2.444753172 | 2.340767158 | 2.32143889 | 2.282608342 | 2.401218719 | 2.385568421 | 2.375334953 | 2.432634747 |
5.378494673 | 6.065038394 | 5.134842087 | 5.367342376 | 5.682051149 | 5.712072512 | 5.57179966 | 5.72082395 | 5.656674512 |
2.833814735 | 3.038434511 | 2.837711812 | 2.859800224 | 2.866040813 | 2.969167906 | 2.929449968 | 2.963530689 | 2.931065261 |
6.192932281 | 6.478439634 | 6.180169144 | 6.151689376 | 6.238949956 | 6.708196123 | 6.441437631 | 6.448280595 | 6.413562269 |
4.543042482 | 4.786227217 | 4.445131477 | 4.51471011 | 4.491645167 | 4.460114204 | 4.602482637 | 4.587221948 | 4.623125028 |
6.069437462 | 6.232738284 | 6.74644117 | 7.04995802 | 6.938928532 | 6.348253102 | 6.080950712 | 6.324619355 | 6.472893789 |
其中 (GSM239170、GSM239323、GSM239324、GSM239332、GSM239333) 是 AML 样本并且 (GSM239326、GSM239328、GSM239329、GSM239331)是对照样品。我想为所有样本绘制基因表达箱线图,将 AML 样本的数据点标记为红色,对照样本的数据点标记为蓝色。
我尝试了下面的代码,但出现了错误。
boxplot(df1, main = "Boxplot")
Error in x[floor(d)] + x[ceiling(d)] : non-numeric argument to binary operator
使用这段代码,
meltData <- melt(df1)
boxplot(meltData, main = "Boxplot")
# Error
meltData <- melt(df1)
Using Probe_ID as id variables
boxplot(meltData, main = "Boxplot")
Error in x[floor(d)] + x[ceiling(d)] :
non-numeric argument to binary operator
如何制作箱线图?
谢谢。
您可以通过旋转数据、在新变量中重新编码并绘图来实现这一点。我正在使用 tidyverse
包集合(tidyr、dplyr、ggplot2 等)来做这个例子。
首先我们加载包,我正在为这个例子创建一个时态数据集。您可以将此应用于您自己的数据。
假设我有一个包含四列的数据集。
library(tidyverse)
tmp <- data.frame(col1 = runif(10),
col2 = runif(10),
col3 = runif(10),
col4 = runif(10))
tmp
#> col1 col2 col3 col4
#> 1 0.90734014 0.5982927 0.8737742 0.4736948
#> 2 0.86152607 0.6449016 0.8081259 0.4717687
#> 3 0.82432693 0.1984189 0.8673977 0.2029126
#> 4 0.85152294 0.2304586 0.6021387 0.3914528
#> 5 0.38723829 0.5963829 0.6431011 0.8964814
#> 6 0.45445156 0.2750627 0.8670836 0.3399746
#> 7 0.38512391 0.5757645 0.9060044 0.1580425
#> 8 0.58636619 0.4287894 0.7827405 0.6647625
#> 9 0.50576266 0.6141626 0.5411717 0.4168770
#> 10 0.06328743 0.1455354 0.7490322 0.9065379
我知道 col1 和 col2 是对照组,所以我需要旋转所有列并创建一个 type 列来识别对照组和治疗组。我之前创建了 control_cols
向量以获得更清晰的代码。我认为如果您是 R 的新手,最快的方法是手动指定 32 个名称。如果您碰巧知道正则表达式,您可以通过在 colnames 列中使用 str_detect()
来利用它。
control_cols <- c("col1", "col2")
tmp_transformed <- tmp %>%
pivot_longer(everything(), names_to = "colname", values_to = "value") %>%
mutate(type = if_else(colname %in% control_cols, "control", "treatment"))
tmp_transformed
#> # A tibble: 40 x 3
#> colname value type
#> <chr> <dbl> <chr>
#> 1 col1 0.907 control
#> 2 col2 0.598 control
#> 3 col3 0.874 treatment
#> 4 col4 0.474 treatment
#> 5 col1 0.862 control
#> 6 col2 0.645 control
#> 7 col3 0.808 treatment
#> 8 col4 0.472 treatment
#> 9 col1 0.824 control
#> 10 col2 0.198 control
#> # ... with 30 more rows
一旦我的数据准备就绪,我现在可以创建一个箱线图,其中每个填充颜色都映射到一个组类型。您可以在 scale_fill_manual()
.
对照与治疗:
tmp_transformed %>%
ggplot(aes(type, value, fill = type)) +
geom_boxplot() +
scale_fill_manual(values = c("cyan", "red"))
对于 colname 和类型
tmp_transformed %>%
ggplot(aes(colname, value, fill = type)) +
geom_boxplot() +
scale_fill_manual(values = c("cyan", "red"))
由 reprex package (v2.0.1)
创建于 2022-01-20要分别为 AML 和控件着色红色和蓝色,请将 names(df1)
与 AML 向量匹配并使用它来索引颜色值向量。
meltData <- reshape2::melt(df1)
aml <- scan(text="GSM239170, GSM239323, GSM239324, GSM239332, GSM239333", what = character(), sep = ",")
aml <- trimws(aml)
i_aml <- (names(df1) %in% aml) + 1L
colors <- c("blue", "red")
boxplot(value ~ variable, meltData, main = "Boxplot", col = colors[i_aml])
数据
df1 <-
structure(list(GSM239170 = c(3.016704177, 7.977735461, 4.207280707,
7.25609471, 2.204955099, 7.28437278, 4.265792764, 2.6943516,
3.646303109, 3.040292001, 2.357625231, 5.378494673, 2.833814735,
6.192932281, 4.543042482, 6.069437462), GSM239323 = c(3.285669072,
6.532514237, 4.994965767, 7.420807337, 2.331625217, 6.983593721,
4.970684112, 2.916324936, 8.817891552, 3.38486713, 2.444753172,
6.065038394, 3.038434511, 6.478439634, 4.786227217, 6.232738284
), GSM239324 = c(2.929482692, 6.388007183, 4.40159671, 6.999340125,
2.133305231, 6.86337111, 4.595545125, 2.578130269, 11.4248793,
2.958851115, 2.340767158, 5.134842087, 2.837711812, 6.180169144,
4.445131477, 6.74644117), GSM239326 = c(2.922820483, 6.466679556,
4.747114589, 7.094488581, 2.18332885, 6.865970678, 4.575545289,
2.659717988, 10.74738082, 3.047880699, 2.32143889, 5.367342376,
2.859800224, 6.151689376, 4.51471011, 7.04995802), GSM239328 = c(3.15950317,
6.432795021, 4.830045513, 7.024332721, 2.12778313, 7.219840938,
4.547957809, 2.567436676, 9.296043108, 2.878562717, 2.282608342,
5.682051149, 2.866040813, 6.238949956, 4.491645167, 6.938928532
), GSM239329 = c(3.163327169, 6.407321524, 4.213762092, 7.17928981,
2.269697813, 7.181113053, 4.68215122, 2.8095128, 9.53150669,
3.209319974, 2.401218719, 5.712072512, 2.969167906, 6.708196123,
4.460114204, 6.348253102), GSM239331 = c(2.985901308, 6.426470803,
4.884418365, 7.159898654, 2.264705552, 7.392230178, 4.674495889,
2.790110381, 8.285160496, 3.20260379, 2.385568421, 5.57179966,
2.929449968, 6.441437631, 4.602482637, 6.080950712), GSM239332 = c(3.122708843,
6.376394357, 4.4318876, 7.009977785, 2.253940441, 7.484052914,
4.675841709, 2.795882913, 9.769919327, 3.195993624, 2.375334953,
5.72082395, 2.963530689, 6.448280595, 4.587221948, 6.324619355
), GSM239333 = c(3.070948463, 6.469070308, 4.849665444, 6.830979234,
2.287924323, 7.52498281, 4.643311767, 2.884588792, 9.774610531,
3.3004227, 2.432634747, 5.656674512, 2.931065261, 6.413562269,
4.623125028, 6.472893789)), class = "data.frame", row.names = c(NA, -16L))