如何调整/选择 AffinityPropagation 的偏好参数?
How to tune / choose the preference parameter of AffinityPropagation?
我有 "pairwise similarity matrixes" 的大字典,如下所示:
similarity['group1']
:
array([[1. , 0. , 0. , 0. , 0. ],
[0. , 1. , 0.09 , 0.09 , 0. ],
[0. , 0.09 , 1. , 0.94535157, 0. ],
[0. , 0.09 , 0.94535157, 1. , 0. ],
[0. , 0. , 0. , 0. , 1. ]])
简而言之,前一个矩阵的每个元素都是record_i
和record_j
相似的概率(值为0和1,包括0和1),1
完全相似并且0
完全不同。
然后我将每个相似度矩阵输入 AffinityPropagation
算法以对相似记录进行分组/聚类:
sim = similarities['group1']
clusterer = AffinityPropagation(affinity='precomputed',
damping=0.5,
max_iter=25000,
convergence_iter=2500,
preference=????)) # ISSUE here
affinity = clusterer.fit(sim)
cluster_centers_indices = affinity.cluster_centers_indices_
labels = affinity.labels_
但是,由于我 运行 上面的多个相似性矩阵,我需要一个通用的 preference
参数,我似乎无法调整它。
它在文档中说它默认设置为相似矩阵的中值,但是我用这个设置得到了很多误报,平均值有时工作有时会给出太多的集群等...
例如:在使用偏好参数时,这些是我从相似度矩阵中得到的结果
preference = default # which is the median (value 0.2) of the similarity matrix
: (不正确的结果,我们看到记录18
不应该存在,因为与其他记录很低):
# Indexes of the elements in Cluster n°5: [15, 18, 22, 27]
{'15_18': 0.08,
'15_22': 0.964546229533378,
'15_27': 0.6909703138051403,
'18_22': 0.12, # Not Ok, the similarity is too low
'18_27': 0.19, # Not Ok, the similarity is too low
'22_27': 0.6909703138051403}
preference = 0.2 in fact from 0.11 to 0.26
:(正确 结果与记录相似):
# Indexes of the elements in Cluster n°5: [15, 22, 27]
{'15_22': 0.964546229533378,
'15_27': 0.6909703138051403,
'22_27': 0.6909703138051403}
我的问题是:我应该如何选择这个 preference
参数来概括?
天真和蛮力 grid search
解决方案可以这样实现,如果连接得分低于某个阈值(例如 0.5),我们会使用 preference
参数的调整值重新 运行 聚类。
一个天真的实现就像下面这样。
首先,一个测试聚类是否需要调整的函数,本例中的阈值为0.5
:
def is_tuning_required(similarity_matrix, rows_of_cluster):
rows = similarity_matrix[rows_of_cluster]
for row in rows:
for col_index in rows_of_cluster:
score = row[col_index]
if score > 0.5:
continue
return True
return False
构建一个偏好值范围,根据该值进行聚类 运行:
def get_pref_range(similarity):
starting_point = np.median(similarity)
if starting_point == 0:
starting_point = np.mean(similarity)
# Let's try to accelerate the pace of values picking
step = 1 if starting_point >= 0.05 else step = 2
preference_tuning_range = [starting_point]
max_val = starting_point
while max_val < 1:
max_val *= 1.25 if max_val > 0.1 and step == 2 else step
preference_tuning_range.append(max_val)
min_val = starting_point
if starting_point >= 0.05:
while min_val > 0.01:
min_val /= step
preference_tuning_range.append(min_val)
return preference_tuning_range
一个正常的AfinityPropagation
,传递了一个preference
参数:
def run_clustering(similarity, preference):
clusterer = AffinityPropagation(damping=0.9,
affinity='precomputed',
max_iter=5000,
convergence_iter=2500,
verbose=False,
preference=preference)
affinity = clusterer.fit(similarity)
labels = affinity.labels_
return labels, len(set(labels)), affinity.cluster_centers_indices_
我们实际调用的方法是将相似性(1 - 距离)矩阵作为参数:
def run_ideal_clustering(similarity):
preference_tuning_range = get_pref_range(similarity)
best_tested_preference = None
for preference in preference_tuning_range:
labels, labels_count, cluster_centers_indices = run_clustering(similarity, preference)
needs_tuning = False
wrong_clusters = 0
for label_index in range(labels_count):
cluster_elements_indexes = np.where(labels == label_index)[0]
tuning_required = is_tuning_required(similarity, cluster_elements_indexes)
if tuning_required:
wrong_clusters += 1
if not needs_tuning:
needs_tuning = True
if best_tested_preference is None or wrong_clusters < best_tested_preference[1]:
best_tested_preference = (preference, wrong_clusters)
if not needs_tuning:
return labels, labels_count, cluster_centers_indices
# The clustering has not been tuned enough during the iterations, we choose the less wrong clusters
return run_clustering(similarity, preference)
显然,这是一种蛮力解决方案,在大型数据集/相似性矩阵中性能不佳。
如果发布更简单更好的解决方案,我会接受。
我有 "pairwise similarity matrixes" 的大字典,如下所示:
similarity['group1']
:
array([[1. , 0. , 0. , 0. , 0. ],
[0. , 1. , 0.09 , 0.09 , 0. ],
[0. , 0.09 , 1. , 0.94535157, 0. ],
[0. , 0.09 , 0.94535157, 1. , 0. ],
[0. , 0. , 0. , 0. , 1. ]])
简而言之,前一个矩阵的每个元素都是record_i
和record_j
相似的概率(值为0和1,包括0和1),1
完全相似并且0
完全不同。
然后我将每个相似度矩阵输入 AffinityPropagation
算法以对相似记录进行分组/聚类:
sim = similarities['group1']
clusterer = AffinityPropagation(affinity='precomputed',
damping=0.5,
max_iter=25000,
convergence_iter=2500,
preference=????)) # ISSUE here
affinity = clusterer.fit(sim)
cluster_centers_indices = affinity.cluster_centers_indices_
labels = affinity.labels_
但是,由于我 运行 上面的多个相似性矩阵,我需要一个通用的 preference
参数,我似乎无法调整它。
它在文档中说它默认设置为相似矩阵的中值,但是我用这个设置得到了很多误报,平均值有时工作有时会给出太多的集群等...
例如:在使用偏好参数时,这些是我从相似度矩阵中得到的结果
preference = default # which is the median (value 0.2) of the similarity matrix
: (不正确的结果,我们看到记录18
不应该存在,因为与其他记录很低):# Indexes of the elements in Cluster n°5: [15, 18, 22, 27] {'15_18': 0.08, '15_22': 0.964546229533378, '15_27': 0.6909703138051403, '18_22': 0.12, # Not Ok, the similarity is too low '18_27': 0.19, # Not Ok, the similarity is too low '22_27': 0.6909703138051403}
preference = 0.2 in fact from 0.11 to 0.26
:(正确 结果与记录相似):# Indexes of the elements in Cluster n°5: [15, 22, 27] {'15_22': 0.964546229533378, '15_27': 0.6909703138051403, '22_27': 0.6909703138051403}
我的问题是:我应该如何选择这个 preference
参数来概括?
天真和蛮力 grid search
解决方案可以这样实现,如果连接得分低于某个阈值(例如 0.5),我们会使用 preference
参数的调整值重新 运行 聚类。
一个天真的实现就像下面这样。
首先,一个测试聚类是否需要调整的函数,本例中的阈值为0.5
:
def is_tuning_required(similarity_matrix, rows_of_cluster):
rows = similarity_matrix[rows_of_cluster]
for row in rows:
for col_index in rows_of_cluster:
score = row[col_index]
if score > 0.5:
continue
return True
return False
构建一个偏好值范围,根据该值进行聚类 运行:
def get_pref_range(similarity):
starting_point = np.median(similarity)
if starting_point == 0:
starting_point = np.mean(similarity)
# Let's try to accelerate the pace of values picking
step = 1 if starting_point >= 0.05 else step = 2
preference_tuning_range = [starting_point]
max_val = starting_point
while max_val < 1:
max_val *= 1.25 if max_val > 0.1 and step == 2 else step
preference_tuning_range.append(max_val)
min_val = starting_point
if starting_point >= 0.05:
while min_val > 0.01:
min_val /= step
preference_tuning_range.append(min_val)
return preference_tuning_range
一个正常的AfinityPropagation
,传递了一个preference
参数:
def run_clustering(similarity, preference):
clusterer = AffinityPropagation(damping=0.9,
affinity='precomputed',
max_iter=5000,
convergence_iter=2500,
verbose=False,
preference=preference)
affinity = clusterer.fit(similarity)
labels = affinity.labels_
return labels, len(set(labels)), affinity.cluster_centers_indices_
我们实际调用的方法是将相似性(1 - 距离)矩阵作为参数:
def run_ideal_clustering(similarity):
preference_tuning_range = get_pref_range(similarity)
best_tested_preference = None
for preference in preference_tuning_range:
labels, labels_count, cluster_centers_indices = run_clustering(similarity, preference)
needs_tuning = False
wrong_clusters = 0
for label_index in range(labels_count):
cluster_elements_indexes = np.where(labels == label_index)[0]
tuning_required = is_tuning_required(similarity, cluster_elements_indexes)
if tuning_required:
wrong_clusters += 1
if not needs_tuning:
needs_tuning = True
if best_tested_preference is None or wrong_clusters < best_tested_preference[1]:
best_tested_preference = (preference, wrong_clusters)
if not needs_tuning:
return labels, labels_count, cluster_centers_indices
# The clustering has not been tuned enough during the iterations, we choose the less wrong clusters
return run_clustering(similarity, preference)
显然,这是一种蛮力解决方案,在大型数据集/相似性矩阵中性能不佳。
如果发布更简单更好的解决方案,我会接受。