使用 ggplot2 可视化从 MClust 中提取的集群

Question

我正在使用 mclust 分析我的数据分布（跟进）
这是我的下载数据https://www.file-upload.net/download-14320392/example.csv.html

首先，我评估数据中存在的集群：

library(reshape2)
library(mclust)
library(ggplot2)

data <- read.csv(file.choose(), header=TRUE,  check.names = FALSE)
data_melt <- melt(data, value.name = "value", na.rm=TRUE)

fit <- Mclust(data$value, modelNames="E", G = 1:7)
summary(fit, parameters = TRUE)

---------------------------------------------------- 
Gaussian finite mixture model fitted by EM algorithm 
---------------------------------------------------- 

Mclust E (univariate, equal variance) model with 4 components: 

log-likelihood    n df       BIC       ICL
-20504.71 3258  8 -41074.13 -44326.69

Clustering table:
1    2    3    4 
0 2271  896   91 

Mixing probabilities:
1         2         3         4 
0.2807685 0.4342499 0.2544305 0.0305511 

Means:
1        2        3        4 
1381.391 1381.715 1574.335 1851.667 

Variances:
1        2        3        4 
7466.189 7466.189 7466.189 7466.189

现在确定了它们，我想用单个组件的分布覆盖总分布。为此，我尝试使用以下方法将每个值的分配提取到相应的集群：

df <- as.data.frame(data)
df$classification <- as.factor(df$value[fit$classification])

ggplot(df, aes(value, fill= classification)) + 
  geom_density(aes(col=classification, fill = NULL), size = 1)

结果，我得到以下信息：

它看起来有效，但是，我想知道，
a）各个分类的描述（1602、1639和1823）来自哪里
b) 我如何将个体密度缩放为总密度的一部分（例如，1823 在 3258 个观测值中仅贡献了 91 个值；见上文）
c) 根据获得的均值 + SD 交替使用预测的正态分布是否有意义？

非常感谢任何帮助或建议！

Answer 1

我想你可以通过以下方式得到你想要的：

library(magrittr)
data_melt <- data_melt %>% mutate(class = as.factor(fit$classification))
ggplot(data_melt, aes(x=value, colour=class, fill=class)) + 
    geom_density(aes(y=..count..), alpha=.25)

使用 ggplot2 可视化从 MClust 中提取的集群

visualizing clusters extracted from MClust using ggplot2

r

cluster-analysis

ggplot2

gmm

mclust