DBSCAN 簇大小小于 MinPts

DBSCAN Clustersize smaller than MinPts

我刚刚想到了DBSCAN 的一些特例。案例如图here。 假设 eps 等于圆的半径。对于 MinPts=3 p 和 r 是核心点。不清楚 q 是属于 p 的簇还是属于 r 的簇。如果使用递归实现并且算法首先检查 r,则 q 将是 r 的集群的一部分。因此 p 将定义一个只有两个元素的集群。原文 paper 声明:"Note that a cluster wrt. Eps and MinPts contains at least MinPts points [...]" 我是不是遗漏了什么,或者只是没有考虑到这种特殊情况?

例如,q也是一个核心点:圆圈内有三个点:p,q,r。在这个例子中你需要 minPts=4。

您需要将密度聚类的理论定义与高效算法输出区分开来,后者仅"almost"给出理论结果是有充分理由的:在理论模型,q 将是两个集群的一部分。但这对 用户.

来说既不方便又令人惊讶

你不是第一个注意到的。甚至维基百科都知道这一点:

While minPts intuitively is the minimum cluster size, in some cases DBSCAN can produce smaller clusters.[5] A DBSCAN cluster consists of at least one core point.[5] As other points may be border points to more than one cluster, there is no guarantee that at least minPts points are included in every cluster.

参考[5]是文章

Schubert, Erich; Sander, Jörg; Ester, Martin; Kriegel, Hans Peter; Xu, Xiaowei (July 2017). "DBSCAN Revisited, Revisited: Why and How You Should (Still) Use DBSCAN". ACM Trans. Database Syst. 42 (3): 19:1–19:21. doi:10.1145/3068335. ISSN 0362-5915.

其中包含脚注:

Note that this can, in rare cases, lead to a cluster with fewer than minPts points, if too many border points are reachable by different clusters, and have previously been assigned to other clusters. Every cluster will at least have one core point. Multi-assignment to exactly represent the theoretical model—or an assignment by shortest distance—can be implemented easily.