旧 (sklearn 0.17) GMM、DPGM、VBGMM 与新 (sklearn 0.18) GaussianMixture 和 BayesianGaussianMixture
Old (sklearn 0.17) GMM, DPGM, VBGMM vs new (sklearn 0.18) GaussianMixture and BayesianGaussianMixture
在以前的 scikit-learn 版本 (0.1.17) 中,我使用以下代码自动确定最佳高斯混合模型并优化超参数(alpha、协方差类型、bic)以进行无监督聚类。
# Gaussian Mixture Model
try:
# Determine the most suitable covariance_type
lowest_bic = np.infty
bic = []
cv_types = ['spherical', 'tied', 'diag', 'full']
for cv_type in cv_types:
# Fit a mixture of Gaussians with EM
gmm = mixture.GMM(n_components=NUMBER_OF_CLUSTERS, covariance_type=cv_type)
gmm.fit(transformed_features)
bic.append(gmm.bic(transformed_features))
if bic[-1] < lowest_bic:
lowest_bic = bic[-1]
best_gmm = gmm
best_covariance_type = cv_type
gmm = best_gmm
except Exception, e:
print 'Error with GMM estimator. Error: %s' % e
# Dirichlet Process Gaussian Mixture Model
try:
# Determine the most suitable alpha parameter
alpha = 2/math.log(len(transformed_features))
# Determine the most suitable covariance_type
lowest_bic = np.infty
bic = []
cv_types = ['spherical', 'tied', 'diag', 'full']
for cv_type in cv_types:
# Fit a mixture of Gaussians with EM
dpgmm = mixture.DPGMM(n_components=NUMBER_OF_CLUSTERS, covariance_type=cv_type, alpha = alpha)
dpgmm.fit(transformed_features)
bic.append(dpgmm.bic(transformed_features))
if bic[-1] < lowest_bic:
lowest_bic = bic[-1]
best_dpgmm = dpgmm
best_covariance_type = cv_type
dpgmm = best_dpgmm
except Exception, e:
print 'Error with DPGMM estimator. Error: %s' % e
# Variational Inference for Gaussian Mixture Model
try:
# Determine the most suitable alpha parameter
alpha = 2/math.log(len(transformed_features))
# Determine the most suitable covariance_type
lowest_bic = np.infty
bic = []
cv_types = ['spherical', 'tied', 'diag', 'full']
for cv_type in cv_types:
# Fit a mixture of Gaussians with EM
vbgmm = mixture.VBGMM(n_components=NUMBER_OF_CLUSTERS, covariance_type=cv_type, alpha = alpha)
vbgmm.fit(transformed_features)
bic.append(vbgmm.bic(transformed_features))
if bic[-1] < lowest_bic:
lowest_bic = bic[-1]
best_vbgmm = vbgmm
best_covariance_type = cv_type
vbgmm = best_vbgmm
except Exception, e:
print 'Error with VBGMM estimator. Error: %s' % e
如何使用 scikit-learn 0.1.18 中引入的新高斯混合/贝叶斯高斯混合模型实现相同或相似的行为?
根据 scikit-learn 文档,不再有 "alpha" 参数,而是有 "weight_concentration_prior" 参数。这些是否相同?
http://scikit-learn.org/stable/modules/generated/sklearn.mixture.BayesianGaussianMixture.html#sklearn.mixture.BayesianGaussianMixture
weight_concentration_prior : float | None, optional.
The dirichlet concentration of each component on the weight distribution (Dirichlet). The higher concentration puts more mass in
the center and will lead to more components being active, while a
lower concentration parameter will lead to more mass at the edge of
the mixture weights simplex. The value of the parameter must be
greater than 0. If it is None, it’s set to 1. / n_components.
http://scikit-learn.org/0.17/modules/generated/sklearn.mixture.VBGMM.html
alpha: float, default 1 :
Real number representing the concentration parameter of the dirichlet distribution. Intuitively, the higher the value of alpha the
more likely the variational mixture of Gaussians model will use all
components it can.
如果这两个参数(alpha 和 weight_concentration_prior)相同,是否意味着公式 alpha = 2/math.log(len(transformed_features)) 仍然适用weight_concentration_prior = 2/math.log(len(transformed_features))?
在 GaussianMixture class.
中实现的 GMM 的 classical/EM 实现仍然可以使用 BIC 分数
BayesianGaussianMixture class 可以自动调整给定值 alpha
的有效成分的数量(n_components 应该足够大)。
您还可以对对数似然使用标准交叉验证(使用模型的 score
方法)。
在以前的 scikit-learn 版本 (0.1.17) 中,我使用以下代码自动确定最佳高斯混合模型并优化超参数(alpha、协方差类型、bic)以进行无监督聚类。
# Gaussian Mixture Model
try:
# Determine the most suitable covariance_type
lowest_bic = np.infty
bic = []
cv_types = ['spherical', 'tied', 'diag', 'full']
for cv_type in cv_types:
# Fit a mixture of Gaussians with EM
gmm = mixture.GMM(n_components=NUMBER_OF_CLUSTERS, covariance_type=cv_type)
gmm.fit(transformed_features)
bic.append(gmm.bic(transformed_features))
if bic[-1] < lowest_bic:
lowest_bic = bic[-1]
best_gmm = gmm
best_covariance_type = cv_type
gmm = best_gmm
except Exception, e:
print 'Error with GMM estimator. Error: %s' % e
# Dirichlet Process Gaussian Mixture Model
try:
# Determine the most suitable alpha parameter
alpha = 2/math.log(len(transformed_features))
# Determine the most suitable covariance_type
lowest_bic = np.infty
bic = []
cv_types = ['spherical', 'tied', 'diag', 'full']
for cv_type in cv_types:
# Fit a mixture of Gaussians with EM
dpgmm = mixture.DPGMM(n_components=NUMBER_OF_CLUSTERS, covariance_type=cv_type, alpha = alpha)
dpgmm.fit(transformed_features)
bic.append(dpgmm.bic(transformed_features))
if bic[-1] < lowest_bic:
lowest_bic = bic[-1]
best_dpgmm = dpgmm
best_covariance_type = cv_type
dpgmm = best_dpgmm
except Exception, e:
print 'Error with DPGMM estimator. Error: %s' % e
# Variational Inference for Gaussian Mixture Model
try:
# Determine the most suitable alpha parameter
alpha = 2/math.log(len(transformed_features))
# Determine the most suitable covariance_type
lowest_bic = np.infty
bic = []
cv_types = ['spherical', 'tied', 'diag', 'full']
for cv_type in cv_types:
# Fit a mixture of Gaussians with EM
vbgmm = mixture.VBGMM(n_components=NUMBER_OF_CLUSTERS, covariance_type=cv_type, alpha = alpha)
vbgmm.fit(transformed_features)
bic.append(vbgmm.bic(transformed_features))
if bic[-1] < lowest_bic:
lowest_bic = bic[-1]
best_vbgmm = vbgmm
best_covariance_type = cv_type
vbgmm = best_vbgmm
except Exception, e:
print 'Error with VBGMM estimator. Error: %s' % e
如何使用 scikit-learn 0.1.18 中引入的新高斯混合/贝叶斯高斯混合模型实现相同或相似的行为?
根据 scikit-learn 文档,不再有 "alpha" 参数,而是有 "weight_concentration_prior" 参数。这些是否相同? http://scikit-learn.org/stable/modules/generated/sklearn.mixture.BayesianGaussianMixture.html#sklearn.mixture.BayesianGaussianMixture
weight_concentration_prior : float | None, optional. The dirichlet concentration of each component on the weight distribution (Dirichlet). The higher concentration puts more mass in the center and will lead to more components being active, while a lower concentration parameter will lead to more mass at the edge of the mixture weights simplex. The value of the parameter must be greater than 0. If it is None, it’s set to 1. / n_components.
http://scikit-learn.org/0.17/modules/generated/sklearn.mixture.VBGMM.html
alpha: float, default 1 : Real number representing the concentration parameter of the dirichlet distribution. Intuitively, the higher the value of alpha the more likely the variational mixture of Gaussians model will use all components it can.
如果这两个参数(alpha 和 weight_concentration_prior)相同,是否意味着公式 alpha = 2/math.log(len(transformed_features)) 仍然适用weight_concentration_prior = 2/math.log(len(transformed_features))?
在 GaussianMixture class.
中实现的 GMM 的 classical/EM 实现仍然可以使用 BIC 分数BayesianGaussianMixture class 可以自动调整给定值 alpha
的有效成分的数量(n_components 应该足够大)。
您还可以对对数似然使用标准交叉验证(使用模型的 score
方法)。