使用 ggplot 在 R 中可视化，这是一个 k 均值聚类发育基因表达数据集

Question

我可以看到很多关于这个主题的帖子，但是 none 解决了这个问题。如果我错过了相关答案，我深表歉意。我有一个很大的蛋白质表达数据集，列中有这样的样本： rep1_0hr、rep1_16hr、rep1_24hr、rep1_48hr、rep1_72hr .....

和 2000 多种蛋白质。换句话说，每个样本都是不同的发育时间点。

如果有任何兴趣，原始数据集是 'mulvey2015' 来自 R 中的 pRolocdata 包，我在 RStudio 中将其转换为 SummarizedExperiment 对象。

我首先运行对数据进行 k 均值聚类（SummarizedExperiment 数据集的 assay()，以获得 12 个聚类：

k_mul <- kmeans(scale(assay(mul)), centers = 12, nstart = 10)

然后：

summary(k_mul)

产生了预期的输出。

我希望可视化看起来像这样，样本在 x 轴上，表达式在 y 轴上。这些图看起来像是使用 ggplot 中的 facet_wrap() 生成的：

对于 ggplot，数据需要作为数据框提供，其中有一列用于单个蛋白质的簇标识。此外，数据需要采用长格式。我尝试旋转 (pivot_longer) 原始数据集，但当然有非常多的数据点。此外，我发布的图片显示，对于任何一个图，彩色线条的数量都小于蛋白质的总数，这表明数据集可能首先进行了降维，但我不确定。到目前为止，我一直在运行没有降维的 kmeans 算法。我可以得到如何制作这个情节的指导吗？

Answer 1

这是我对情节进行逆向工程的尝试：

library(pRolocdata)
library(dplyr)
library(tidyverse)
library(magrittr)
library(ggplot2)

mulvey2015 %>%
  Biobase::assayData() %>%
  magrittr::extract2("exprs") %>%
  data.frame(check.names = FALSE) %>%
  tibble::rownames_to_column("prot_id") %>%
  mutate(.,
         cl = kmeans(select(., -prot_id),
                     centers = 12,
                     nstart = 10) %>%
           magrittr::extract2("cluster") %>%
           as.factor()) %>%
  pivot_longer(cols = !c(prot_id, cl),
               names_to = "Timepoint",
               values_to = "Expression") %>%
  ggplot(aes(x = Timepoint, y = Expression, color = cl)) +
  geom_line(aes(group = prot_id)) +
  facet_wrap(~ cl, ncol = 4)

至于你的问题，pivot_longer 通常性能很好，除非它无法找到键中的唯一组合或与数据类型转换相关的问题。情节可以通过以下方式改进：

调整 geom_lines 的 alpha 参数（例如 alpha = 0.5），以提供线条密度的想法
为 Timepoint
改变axis.text.x方向

Answer 2

这是我自己的解决方案，与上述解决方案非常相似。

dfsa_mul <- data.frame(scale(assay(mul)))
dfsa_mul2 <- rownames_to_column(dfsa_mul, "protID")

将 kmeans $cluster 列添加到 dfsa_mul2 数据框。执行pivot_longer

后才把clus改成一个因子

dfsa_mul2$clus <- ksa_mul$cluster
dfsa_mul2 %>% 
  pivot_longer(cols = -c("protID", "clus"),
               names_to = "samples",
               values_to = "expression") %>% 
ggplot(aes(x = samples, y = expression, colour = factor(clus))) +
  geom_line(aes(group = protID)) +
  facet_wrap(~ factor(clus))

这会生成一系列与@sbarbit 发布的图表相同的图表。

使用 ggplot 在 R 中可视化，这是一个 k 均值聚类发育基因表达数据集

Visualise in R with ggplot, a k-means clustered developmental gene expression dataset

r

ggplot2

k-means