如何提高亲和传播问题的时间复杂度？

Question

我正在尝试使用亲和传播聚类方法对列表中的相似模式进行聚类。 self_pat 是一个包含 80K 个需要聚类的模式的列表。我正在使用以下代码：

self_pat = np.asarray(self_pat) #So that indexing with a list will work
lev_similarity = -1*np.array([[calculate_levenshtein_distance(w1,w2) for w1 in self_pat] for w2 in self_pat])

affprop = AffinityPropagation(affinity="precomputed", damping=0.5)

affprop.fit(lev_similarity)

 

for cluster_id in np.unique(affprop.labels_):

    exemplar = words_pat[affprop.cluster_centers_indices_[cluster_id]]

    cluster = np.unique(words_pat[np.nonzero(affprop.labels_==cluster_id)])

    cluster_str = ", ".join(cluster)

    print(" - *%s:* %s" % (exemplar, cluster_str))

calculate_levenshtein_distance函数如下：

def calculate_levenshtein_distance(str_1, str_2):
    """
        The Levenshtein distance is a string metric for measuring the difference between two sequences.
        It is calculated as the minimum number of single-character edits necessary to transform one string into another
    """
    distance = 0
    buffer_removed = buffer_added = 0
    for x in ndiff(str_1, str_2):
        code = x[0]
        # Code ? is ignored as it does not translate to any modification
        if code == ' ':
            distance += max(buffer_removed, buffer_added)
            buffer_removed = buffer_added = 0
        elif code == '-':
            buffer_removed += 1
        elif code == '+':
            buffer_added += 1
    distance += max(buffer_removed, buffer_added)
    return distance

上面的程序使用了 3 个循环来执行，因此需要更多的时间来进行聚类。有什么办法可以降低程序的复杂性吗？

Answer 1

对于较小的数据集，完成时间通常没问题；对于非常大的数据集，完成一项工作所花费的时间基本上是无法忍受的。正如您所发现的，集群无法很好地扩展。也许你可以从你的完整数据集中随机抽取一个样本。

# Fraction of rows
# here you get .25 % of the rows
df.sample(frac = 0.25)

如何提高亲和传播问题的时间复杂度？

How can I improve the time complexity of Affinity Propagation problem?

python

cluster-analysis

pattern-matching

affinity

python-3.x