如何调整/选择 AffinityPropagation 的偏好参数?

How to tune / choose the preference parameter of AffinityPropagation?

我有 "pairwise similarity matrixes" 的大字典,如下所示:

similarity['group1']:

array([[1.        , 0.        , 0.        , 0.        , 0.        ],
       [0.        , 1.        , 0.09      , 0.09      , 0.        ],
       [0.        , 0.09      , 1.        , 0.94535157, 0.        ],
       [0.        , 0.09      , 0.94535157, 1.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        , 1.        ]])

简而言之,前一个矩阵的每个元素都是record_irecord_j相似的概率(值为0和1,包括0和1),1完全相似并且0完全不同。

然后我将每个相似度矩阵输入 AffinityPropagation 算法以对相似记录进行分组/聚类:

sim = similarities['group1']

clusterer = AffinityPropagation(affinity='precomputed', 
                                damping=0.5, 
                                max_iter=25000, 
                                convergence_iter=2500, 
                                preference=????)) # ISSUE here

affinity = clusterer.fit(sim)

cluster_centers_indices = affinity.cluster_centers_indices_
labels = affinity.labels_

但是,由于我 运行 上面的多个相似性矩阵,我需要一个通用的 preference 参数,我似乎无法调整它。

它在文档中说它默认设置为相似矩阵的中值,但是我用这个设置得到了很多误报,平均值有时工作有时会给出太多的集群等...


例如:在使用偏好参数时,这些是我从相似度矩阵中得到的结果


我的问题是:我应该如何选择这个 preference 参数来概括?

天真和蛮力 grid search 解决方案可以这样实现,如果连接得分低于某个阈值(例如 0.5),我们会使用 preference 参数的调整值重新 运行 聚类。

一个天真的实现就像下面这样。


首先,一个测试聚类是否需要调整的函数,本例中的阈值为0.5

def is_tuning_required(similarity_matrix, rows_of_cluster):
    rows = similarity_matrix[rows_of_cluster]

    for row in rows:
        for col_index in rows_of_cluster:
            score = row[col_index]

            if score > 0.5:
                continue

            return True

    return False

构建一个偏好值范围,根据该值进行聚类 运行:

def get_pref_range(similarity):
    starting_point = np.median(similarity)

    if starting_point == 0:
        starting_point = np.mean(similarity)

    # Let's try to accelerate the pace of values picking
    step = 1 if starting_point >= 0.05 else step = 2

    preference_tuning_range = [starting_point]
    max_val = starting_point
    while max_val < 1:
        max_val *= 1.25 if max_val > 0.1 and step == 2 else step

    preference_tuning_range.append(max_val)

    min_val = starting_point
    if starting_point >= 0.05:
        while min_val > 0.01:
            min_val /= step
            preference_tuning_range.append(min_val)

    return preference_tuning_range

一个正常的AfinityPropagation,传递了一个preference参数:

def run_clustering(similarity, preference):
    clusterer = AffinityPropagation(damping=0.9, 
                                    affinity='precomputed', 
                                    max_iter=5000, 
                                    convergence_iter=2500, 
                                    verbose=False, 
                                    preference=preference)

    affinity = clusterer.fit(similarity)

    labels = affinity.labels_

    return labels, len(set(labels)), affinity.cluster_centers_indices_

我们实际调用的方法是将相似性(1 - 距离)矩阵作为参数:

def run_ideal_clustering(similarity):
    preference_tuning_range = get_pref_range(similarity)

    best_tested_preference = None
    for preference in preference_tuning_range:
        labels, labels_count, cluster_centers_indices = run_clustering(similarity, preference)

        needs_tuning = False
        wrong_clusters = 0
        for label_index in range(labels_count):
            cluster_elements_indexes = np.where(labels == label_index)[0]

            tuning_required = is_tuning_required(similarity, cluster_elements_indexes)
            if tuning_required:
                wrong_clusters += 1

                if not needs_tuning:
                    needs_tuning = True

        if best_tested_preference is None or wrong_clusters < best_tested_preference[1]:
            best_tested_preference = (preference, wrong_clusters)

        if not needs_tuning:
            return labels, labels_count, cluster_centers_indices

     # The clustering has not been tuned enough during the iterations, we choose the less wrong clusters
    return run_clustering(similarity, preference)

显然,这是一种蛮力解决方案,在大型数据集/相似性矩阵中性能不佳。

如果发布更简单更好的解决方案,我会接受。