如何绘制三个不同组的主成分 1、2 和 3 的箱线图？

Question

我计算了主成分分析 (PCA) 并得出了 PC1 与 PC2 的关系图。这显示了在比较三个疾病组（0（对照）、1（溃疡性结肠炎）和 2（克罗恩病））时大约 14 个基因表达的变化。

我想为前三个主要成分的每个组绘制一个箱线图，总共产生 9 个箱线图。

计算 PCA 之前的数据矩阵具有对应于数字 0、1 或 2 的行名称。列代表不同的基因（以及相应的基因表达值）。

我使用 prcomp 来计算 PCA 图（缩放和居中以及对数转换）。

这是 PCA 之前我的矩阵的快照；

    structure(c(9.11655423831332, 10.489164314825, 1.91402056531454, 
    7.15827328042159, 4.24137583841638, 8.27769344002199, 8.56104058610663, 
    10.4808234419919, 2.90978833628418, 6.23818256006594, 5.22964773531333, 
    10.7708328724305, 7.29461400089235, 11.8318994425553, 3.03424662623575, 
    8.01272738639518, 4.99017087770597, 11.5985078491858, 7.81888257764922, 
    11.9022935347989, 1.27378277405718, 7.22371591364402, 5.35032777682152, 
    11.3245694322554, 7.53493825433311, 12.3702117577478, 2.28591365299837, 
    6.3684670711928, 4.79325114470697, 11.2368359301193, 7.42400102411584, 
    10.4893608659259, 2.29357094839174, 7.39880980207098, 4.06127337845416, 
    10.064874404576, 8.23639009062635, 12.041628287702, 1.68881444318413, 
    6.83433748681479, 4.58216981866268, 10.7369117797388, 8.52022902181642, 
    11.8310518930764, 1.09698581801487, 7.01560705946119, 4.42096319700341, 
    9.55024900954538, 6.78397242802669, 10.7346656491963, 1.8562428132184, 
    6.79381714159694, 4.76311785326908, 9.2896578696716, 7.38261637784709, 
   11.8956476271189, 0.676793904156995, 7.12068629785535, 4.50969591112091, 
   10.3965680730289, 7.76024460081224, 11.4191374294463, 2.51273901194187, 
    6.49764372886188, 5.95216200154652, 8.80877686581081, 7.92745512232284, 
    9.64936710370214, 2.75037060332872, 8.32919606967059, 5.13312284319216, 
    10.0205608136955, 8.32640003009823, 10.7914139100956, 3.07554840032925, 
    7.71871340592007, 5.75595649315905, 9.71791978048218, 7.13284940508783, 
   10.9113426747693, 1.07350504928193, 6.56249247218448, 5.35574874951741, 
   9.54833175767732), .Dim = c(6L, 14L), .Dimnames = list(c("1", 
   "1", "0", "0", "2", "2"), c("Gene1", "Gene2", "Gene3", "Gene4", 
   "Gene5", "Gene6", "Gene7", "Gene8", "Gene9", "Gene10", "Gene11", 
   "Gene12", "Gene13", "Gene14")))

更新；第二个问题删除。

PCA图代码如下；

   data.mat.1.pca <- prcomp(log(data.mat.1), scale.=T, center=T)

   pcvalues <- summary(data.mat.1.pca)

   #colour coding each disease group

   rownames(data.mat.1)
   colour_disease <- rownames(data.mat.1)


   position_control<- grep("0", colour_disease)
   position_UC<- grep("1", colour_disease)
   position_Crohn<- grep("2", colour_disease)

   disease <- vector()
   disease[position_control] <- "lightskyblue"
   disease[position_UC] <- "lightslategrey"
   disease[position_Crohn] <- "lightpink2"



   ##proportion of variance explained for PC1 and PC2 for plot

   eigs<- data.mat.1.pca$sdev^2


  varExplained.pc1<- round(eigs[1]/sum(eigs), digits=3)*100

  varExplained.pc2 <- round(eigs[2]/sum(eigs), digits=3)*100


  plot(data.mat.1.pca$x[,1], data.mat.1.pca$x[,2],
   col=disease, bg=disease, pch=19, cex=1,
   xlab=paste("PCA 1 (", varExplained.pc1, "%)", sep=""),
   ylab=paste("PCA 2 (", varExplained.pc2, "%)", sep=""))
  legend("bottomright", legend = c("Control", "UC", "Crohns"),                

   fill=c("lightskyblue", "lightslategrey", "lightpink2"))

前三台PC的值如下；

                          PC1            PC2              PC3
  S.D                     3.6619        0.44801          0.30046
  Proportion of Variance  0.9578        0.01424          0.00645
  Cumulative proportion   0.9578        0.97215          0.97860

这里是link一张研究论文图片https://www.researchgate.net/figure/Boxplots-of-the-first-three-principal-components-of-the-kidney-data-Group-specific_fig1_316641179

他们正在比较对照组和治疗组，而我需要三个箱形图（每组一个）。

或者这个https://www.researchgate.net/figure/Three-dimensional-principal-component-analysis-PCA-and-b-boxplots-of-principal_fig4_307533060

Answer 1

像这样绘制组件分数很奇怪，但请尝试以下操作以获得您提到的组合的点图：

df = data.frame(disease=rownames(data.mat.1),data.mat.1.pca$x[,1:3])

df %>% pivot_longer(-disease) %>%
ggplot(aes(x=name,col=disease,y=value)) + 
geom_point(position=position_jitterdodge())

我希望你每组有 2 个以上的样本，不像你的例子。添加箱线图很简单：

df %>% pivot_longer(-disease) %>%
ggplot(aes(x=name,col=disease,y=value)) + 
geom_point(position=position_jitterdodge())+
geom_boxplot(alpha=0.7)

如何绘制三个不同组的主成分 1、2 和 3 的箱线图？

How can I plot box-plots for principal components 1, 2 and 3 for three different groups?

r

pca

boxplot