python 的文本数据聚类

Text data clustering with python

我目前正在尝试使用 python.

根据序列的相似性对序列列表进行聚类

例如:

DFKLKSLFD

DLFKFKDLD

LDPELDKSL
...

我预处理数据的方式是使用 Levenshtein distance 计算成对距离。在计算完所有成对距离并创建距离矩阵后,我想将其用作聚类算法的输入。

我已经尝试使用 Affinity Propagation,但收敛有点不可预测,我想解决这个问题。

有没有人对这种情况的其他合适的聚类算法有任何建议?

谢谢!!

sklearn actually does show this example using DBSCAN, just like Luke曾经在这里回答过。

这是基于那个例子,使用 !pip install python-Levenshtein。 但是,如果您已经预先计算了所有距离,则可以更改自定义指标,如下所示。

from Levenshtein import distance

import numpy as np
from sklearn.cluster import dbscan

data = ["DFKLKSLFD", "DLFKFKDLD", "LDPELDKSL"]

def z:
    i, j = int(x[0]), int(y[0])     # extract indices
    return distance(data[i], data[j])

X = np.arange(len(data)).reshape(-1, 1)

dbscan(X, metric=lev_metric, eps=5, min_samples=2)

如果您预先计算,您可以按照

定义 pre_lev_metric(x, y)
def pre_lev_metric(x, y):
    i, j = int(x[0]), int(y[0])     # extract indices
    return DISTANCES[i,j]

基于 K-Medoids using sklearn_extra.cluster.KMedoids 的备选答案。 K-Medoids还不是很出名,但也只是需要距离。

我必须这样安装

!pip uninstall -y enum34
!pip install scikit-learn-extra

比起我能够创建的集群;

from sklearn_extra.cluster import KMedoids
import numpy as np
from Levenshtein import distance

data = ["DFKLKSLFD", "DLFKFKDLD", "LDPELDKSL"]

def lev_metric(x, y):
    i, j = int(x[0]), int(y[0])     # extract indices
    return distance(data[i], data[j])

X = np.arange(len(data)).reshape(-1, 1)
kmedoids = KMedoids(n_clusters=2, random_state=0, metric=lev_metric).fit(X)

labels/centers在

kmedoids.labels_
kmedoids.cluster_centers_

试试这个。

import numpy as np
from sklearn.cluster import AffinityPropagation
import distance
    
words = 'XYZ,LDPELDKSL,DFKLKSLFD,ABC,DLFKFKDLD,XYZ,LDPELDKSL,DFKLKSLFD,ABC,DLFKFKDLD,XYZ,LDPELDKSL,XYZ,LDPELDKSL,DFKLKSLFD,ABC,DLFKFKDLD,XYZ,LDPELDKSL,DFKLKSLFD,ABC,DLFKFKDLD,XYZ,LDPELDKSL'.split(',') #Replace this line
words = np.asarray(words) #So that indexing with a list will work
lev_similarity = -1*np.array([[distance.levenshtein(w1,w2) for w1 in words] for w2 in words])

affprop = AffinityPropagation(affinity="precomputed", damping=0.5)
affprop.fit(lev_similarity)
for cluster_id in np.unique(affprop.labels_):
    exemplar = words[affprop.cluster_centers_indices_[cluster_id]]
    cluster = np.unique(words[np.nonzero(affprop.labels_==cluster_id)])
    cluster_str = ", ".join(cluster)
    print(" - *%s:* %s" % (exemplar, cluster_str))

结果:

 - *LDPELDKSL:* LDPELDKSL
 - *DFKLKSLFD:* DFKLKSLFD
 - *XYZ:* ABC, XYZ
 - *DLFKFKDLD:* DLFKFKDLD
common_words = kmeans.cluster_centers_.argsort()[:,-1:-11:-1]
for num, centroid in enumerate(common_words):
    print(str(num) + ' : ' + ', '.join(words[word] for word in centroid))