旧 (sklearn 0.17) GMM、DPGM、VBGMM 与新 (sklearn 0.18) GaussianMixture 和 BayesianGaussianMixture

Question

在以前的 scikit-learn 版本 (0.1.17) 中，我使用以下代码自动确定最佳高斯混合模型并优化超参数（alpha、协方差类型、bic）以进行无监督聚类。

# Gaussian Mixture Model 
try:       
    # Determine the most suitable covariance_type
    lowest_bic = np.infty
    bic = []
    cv_types = ['spherical', 'tied', 'diag', 'full']
    for cv_type in cv_types:
        # Fit a mixture of Gaussians with EM
        gmm = mixture.GMM(n_components=NUMBER_OF_CLUSTERS, covariance_type=cv_type)
        gmm.fit(transformed_features)
        bic.append(gmm.bic(transformed_features))
        if bic[-1] < lowest_bic:
            lowest_bic = bic[-1]
            best_gmm = gmm
            best_covariance_type = cv_type
    gmm = best_gmm
except Exception, e:       
    print 'Error with GMM estimator. Error: %s' % e 

# Dirichlet Process Gaussian Mixture Model  
try:
    # Determine the most suitable alpha parameter
    alpha = 2/math.log(len(transformed_features))     
    # Determine the most suitable covariance_type
    lowest_bic = np.infty
    bic = []
    cv_types = ['spherical', 'tied', 'diag', 'full']
    for cv_type in cv_types:
        # Fit a mixture of Gaussians with EM
        dpgmm = mixture.DPGMM(n_components=NUMBER_OF_CLUSTERS, covariance_type=cv_type, alpha = alpha)
        dpgmm.fit(transformed_features)
        bic.append(dpgmm.bic(transformed_features))
        if bic[-1] < lowest_bic:
            lowest_bic = bic[-1]
            best_dpgmm = dpgmm
            best_covariance_type = cv_type        
    dpgmm = best_dpgmm                
except Exception, e:       
    print 'Error with DPGMM estimator. Error: %s' % e    

# Variational Inference for Gaussian Mixture Model   
try: 
    # Determine the most suitable alpha parameter 
    alpha = 2/math.log(len(transformed_features))  
    # Determine the most suitable covariance_type
    lowest_bic = np.infty
    bic = []
    cv_types = ['spherical', 'tied', 'diag', 'full']
    for cv_type in cv_types:
        # Fit a mixture of Gaussians with EM
        vbgmm = mixture.VBGMM(n_components=NUMBER_OF_CLUSTERS, covariance_type=cv_type, alpha = alpha)
        vbgmm.fit(transformed_features)
        bic.append(vbgmm.bic(transformed_features))
        if bic[-1] < lowest_bic:
            lowest_bic = bic[-1]
            best_vbgmm = vbgmm
            best_covariance_type = cv_type
    vbgmm = best_vbgmm     
except Exception, e:       
    print 'Error with VBGMM estimator. Error: %s' % e

如何使用 scikit-learn 0.1.18 中引入的新高斯混合/贝叶斯高斯混合模型实现相同或相似的行为？

根据 scikit-learn 文档，不再有 "alpha" 参数，而是有 "weight_concentration_prior" 参数。这些是否相同？ http://scikit-learn.org/stable/modules/generated/sklearn.mixture.BayesianGaussianMixture.html#sklearn.mixture.BayesianGaussianMixture

weight_concentration_prior : float | None, optional. The dirichlet concentration of each component on the weight distribution (Dirichlet). The higher concentration puts more mass in the center and will lead to more components being active, while a lower concentration parameter will lead to more mass at the edge of the mixture weights simplex. The value of the parameter must be greater than 0. If it is None, it’s set to 1. / n_components.

http://scikit-learn.org/0.17/modules/generated/sklearn.mixture.VBGMM.html

alpha: float, default 1 : Real number representing the concentration parameter of the dirichlet distribution. Intuitively, the higher the value of alpha the more likely the variational mixture of Gaussians model will use all components it can.

如果这两个参数（alpha 和 weight_concentration_prior）相同，是否意味着公式 alpha = 2/math.log(len(transformed_features)) 仍然适用weight_concentration_prior = 2/math.log(len(transformed_features))?

Answer 1

在 GaussianMixture class.

中实现的 GMM 的 classical/EM 实现仍然可以使用 BIC 分数

BayesianGaussianMixture class 可以自动调整给定值 alpha 的有效成分的数量（n_components 应该足够大）。

您还可以对对数似然使用标准交叉验证（使用模型的 score 方法）。

旧 (sklearn 0.17) GMM、DPGM、VBGMM 与新 (sklearn 0.18) GaussianMixture 和 BayesianGaussianMixture

Old (sklearn 0.17) GMM, DPGM, VBGMM vs new (sklearn 0.18) GaussianMixture and BayesianGaussianMixture

python

cluster-analysis

gaussian

unsupervised-learning

scikit-learn