列宽的小提琴图ggplot2

violin plot ggplot2 with width from column

我是 R 的新手,我只将它用于可视化,所以我可能会遗漏一些简单的东西。

只是我想要的是,我有两列应该是 x 和 y 轴。我的第三列应该定义图形的宽度。尽管我从不同的答案中尝试了很多东西,但我对代码并没有走得太远。假设我在代码中到目前为止:

ggplot(disM, aes(x=study, y=value)) +
  geom_violin() +
  labs(list(title="Distribution", x="Studies", y="Ranges"))

这并没有真正取得任何成就。

我有一个 table 这样的:

  Col0         study     value
1    30-31 breast cancer    357263
2    32-33 breast cancer    352067
3    34-35 breast cancer    340264
4    36-37 breast cancer    309827
5    38-39 breast cancer    298684
6    40-41 breast cancer    322570
7    42-43 breast cancer    338480
8    44-45 breast cancer    354451
9    46-47 breast cancer    429183
10   48-49 breast cancer    396942
11   50-51 breast cancer    415195
12   52-53 breast cancer    368217
13   54-55 breast cancer    445884
14   56-57 breast cancer    395652
15   58-59 breast cancer    386643
16   60-61 breast cancer    461940
17   62-63 breast cancer    473772
18   64-65 breast cancer    464228
19   66-67 breast cancer    485851
20   68-69 breast cancer    513411
21   70-71 breast cancer    576618
22   72-73 breast cancer    588724
23   74-75 breast cancer    634343
24   76-77 breast cancer    584662
25   78-79 breast cancer    608901
26   80-81 breast cancer    617286
27   82-83 breast cancer    659318
28   84-85 breast cancer    757167
29   86-87 breast cancer   1044465
30   88-89 breast cancer    982901
31   90-91 breast cancer   1114269
32   92-93 breast cancer   1110257
33   94-95 breast cancer   1742966
34   96-97 breast cancer   6379974
35   98-99 breast cancer   3437746
36 100-101 breast cancer 118984063
37   30-31  renal cancer   1055566
38   32-33  renal cancer   1089405
39   34-35  renal cancer   1228087
40   36-37  renal cancer   1265606
41   38-39  renal cancer   1264919
42   40-41  renal cancer   1248949
43   42-43  renal cancer   1391738
44   44-45  renal cancer   1453100
45   46-47  renal cancer   1443915
46   48-49  renal cancer   1429785
47   50-51  renal cancer   1372041
48   52-53  renal cancer   1339706
49   54-55  renal cancer   1418135
50   56-57  renal cancer   1484162
51   58-59  renal cancer   1582617
52   60-61  renal cancer   1571977
53   62-63  renal cancer   1652503
54   64-65  renal cancer   1742230
55   66-67  renal cancer   1859936
56   68-69  renal cancer   1928028
57   70-71  renal cancer   2041783
58   72-73  renal cancer   2108994
59   74-75  renal cancer   2154244
60   76-77  renal cancer   2218430
61   78-79  renal cancer   2333206
62   80-81  renal cancer   2377262
63   82-83  renal cancer   2345651
64   84-85  renal cancer   2402114
65   86-87  renal cancer   2519284
66   88-89  renal cancer   2542761
67   90-91  renal cancer   2587606
68   92-93  renal cancer   2308279
69   94-95  renal cancer   2980927
70   96-97  renal cancer  14108950
71   98-99  renal cancer   2762116
72 100-101  renal cancer 211513230

X 轴应为研究列,y 轴应为 Col0,小提琴图的宽度应为值列。我不能拆分 col0,因为我只有一个范围的数据。

任何关于检查内容的指针,如何做到这一点将不胜感激。对不起,如果我错过了类似的问题。

提前致谢

我来猜一猜。 (如果我是对的,您还可以查找有关 pyramid plots 的信息。)

重新排序标签,使“100-101”真正排在最后:

disM$Col0 <- factor(disM$Col0,levels=unique(disM$Col0))

重新排列以便更容易绘制多边形(我希望有更简单的方法来做到这一点,但想不出一个):

library(plyr)
disM2 <- ddply(disM,"study",
   function(dd) with(dd,
             data.frame(y=c(as.numeric(Col0),rev(as.numeric(Col0))),
                        x=c(-value/2,rev(value/2)))))


library(ggplot2); theme_set(theme_bw())
ggplot(disM2)+
    geom_polygon(aes(x,y),alpha=0.5)+
    facet_wrap(~study)+
    labs(list(title="Distribution"))+
    scale_y_continuous(breaks=as.numeric(disM$Col0),
                       labels=disM$Col0)+
    scale_x_continuous(labels=NULL)

另一种取法如下。

首先你应该计算你的 classes 标记(class 最大值 + class 最小值 / 2)。在您的情况下,它是间隔的中点(您可以使用 levels(x = my_data$col0) 提取)。

my_data$class_mark <- rep(x = seq(from = 30.5, to = 100.5, by = 2), times = 2)

那么你应该拆分你的数据:

my_data_br <- my_data[which(my_data$study == "breast cancer"),]
my_data_re <- my_data[which(my_data$study == "renal cancer"),]

您的 value 列的数字非常大,因此您应该将它们除以最小值:

my_data_br$value <- my_data_br$value/min(my_data_br$value)
my_data_re$value <- my_data_re$value/min(my_data_re$value)

之后,您应该重复每个 class 标记的次数与其 value 的次数相同。

classmark_rep_br <- rep(x = my_data_br[,4], times = my_data_br[,3])
br_rep <- rep("breast cancer", times = length(x = classmark_rep_br))
br_data <- cbind.data.frame(br_rep, classmark_rep_br)
names(br_data) <- c("study", "value")

classmark_rep_re <- rep(x = my_data_re[,4], times = my_data_re[,3])
re_rep <- rep("renal cancer", times = length(x = classmark_rep_re))
re_data <- cbind.data.frame(re_rep, classmark_rep_re)
names(re_data) <- c("study", "value")

最后创建你的新数据库:

my_data2 <- rbind.data.frame(br_data, re_data)

现在您可以根据自己的喜好制作漂亮的图表(例如 the following one)并保存:

my_graph <- ggplot(data = my_data2, aes(x = study, y = value, fill = study)) + geom_violin() + 
  theme(legend.position = "none", panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        axis.text = element_text(size = 10, face = "bold"), panel.background = element_blank(), 
        axis.line = element_line(colour = "black")) + 
  labs(x = "", y = "") + scale_fill_brewer(palette="Pastel1") + 
  scale_x_discrete(labels = c("Breast cancer", "Renal cancer")) + 
  scale_y_continuous(breaks = c(30.5, 40.5, 50.5, 60.5, 70.5, 80.5, 90.5, 100.5),
                     labels = c("30-31", "40-41", "50-51", "60-61", "70-71", "80-81", "90-91", "100-101"))

    ggsave(plot = my_graph, filename = "my_graph", path = "output/path/", device = "png", dpi = 200)