使用基于距离的方法对分类数据集进行聚类

Clustering Categorical data-set with distance based approach

我想将 ROCK 聚类算法与基于距离的算法进行比较。假设我们有 (m) 个训练样本和 (n) 个特征

摇滚:

据我了解,ROCK 所做的是

1. It calculates a similarity matrix (m*m) using Jaccard cooficients.
2. Then a threshold value is provided by the user.
3. Based on the threshold value it then links the data points and the data-points that have more neighbors in common are said to be in same cluster. 
  For Example lets see the below png file,

   The above picture shows the similarity matrix and let threshold_value =0.2. 
4. The algorithm then computes links between the points, which flows in as below.
for A- A   (Only A exceeds the threshold value) 
    B- BCD (because bb, bc and bd exceeds the threshold value)
    C- BCD
    D- BCD
Now since B, C and D each have common neighbor of 3 they are grouped into the same cluster

Therefore we get two clusters {A}, {BCD}

基于距离的方法:

1. I take a different approach, but like ROCK even I create the similarity matrix.
2. Even I compute the initial links like,
   for A- A   (Only A exceeds the threshold value) 
       B- BCD (because bb, bc and bd exceeds the threshold value)
       C- BCD
       D- BCD
3. Now Instead of finding neighbors, I perform some mojo and find the best centroids.
4. After finding the centroid, I run the k-means clustering algorithm over the similarity matrix(m*m)
5. Since I find the centroids before hand, the time taken by the algorithm reduces by not running the k-means algorithm multiple times for randomly chosen centroids.

问题陈述:

我看到的问题是space的复杂度,因为相似度矩阵是一个(m*m)矩阵,如果m的值太大比如100万,存储这么大的矩阵会很困难,同样由于矩阵大小,欧氏距离计算需要时间。

然而,我相信在 ROCK 中,绝对没有必要存储矩阵,因为当在数据集之间计算 Jaccard 系数时,可以即时构建链接。

I 运行 中可用的蘑菇数据集的相似度矩阵上基于距离的算法方法 (uci.org) 并且输出结果与 ROCK 非常相似,对于其他一些数据集甚至更好。

问题:

1. Is my understanding of ROCK correct.
2. Is it even worth considering to create such large similarity matrix and store is in disk and use it later to calculate distances. 
3. I would really appreciate if someone could provide the big O complexity for the distance based approach. 

谢谢:)

据我所知,随着大小的增加,聚类变得非常占用内存,您将不得不想出一种方法来降低数据的维度。

我不熟悉 ROCK,但我之前曾处理过聚类问题,其中我不得不聚类数百万个文档。

Distance Calculation Metric : levenshtein distance
Clustering Algorithm : DBSCAN

回到你的问题

Question 2 : 
Is it even worth considering to create such large similarity matrix and store is in disk and use it later to calculate distances.

我绝不会推荐构建大型矩阵。例如,构建 100 万个单词的距离矩阵需要 4 TB 的 space。您将不得不使用某种分块技术对有些相似的文档进行分组,然后在其上应用聚类算法。

3. I would really appreciate if someone could provide the big O complexity for the distance based approach. 

通常计算两个单词之间的距离的时间复杂度是微不足道的,因为单词不会太长。你的复杂度就是比较的次数,所以如果你有 n 个单词,那么时间复杂度就是 O(n*n)