对数字列表进行分组/聚类，使每个子集的最小-最大差距始终小于 Python 中的截止值

Question

假设我有一个包含 50 个随机数的列表。我想以每个子集的最小-最大差距小于截止值 0.05 的方式对数字进行分组。下面是我的代码。

import random

def cluster(data, cutoff):
    data.sort()
    res = []
    old_x = -10.
    for x in data:
        if abs(x - old_x) > cutoff:
            res.append([x])
        else:
            res[-1].append(x)
        old_x = x
    return res

cutoff = 0.05
data = [random.random() for _ in range(50)]
res = cluster(data, cutoff)

检查是否所有子集的最小-最大间隙都小于截止值：

print(all([(max(s) - min(s)) < cutoff for s in res]))

输出：

False

很明显我的代码不工作。有什么建议吗？

Answer 1

您只是检查下一个元素是否在按排序顺序排列的 上一个 元素的截止范围内（这就是 old_x 是什么），而不是最小元素在它的集群中。因此，例如，您将为输入 [20, 20.03, 20.06].

输出单个簇

通过仅在启动新集群时更新 old_x 来解决此问题。

一般调试提示：始终尝试在小实例上重现您的问题。一个好的方法是从一个失败的测试输入开始，并重复从中删除元素直到它通过。现在您知道 that 元素有一些特别之处。

Answer 2

根据@j_random_hacker的回答，我只是将我的代码更改为

def cluster(data, cutoff):
    data.sort()
    res = []
    old_x = -10.
    for x in data:
        if abs(x - old_x) > cutoff:
            res.append([x])
            old_x = x
        else:
            res[-1].append(x)
    return res

现在一切正常

>>> print(all([(max(s) - min(s)) < cutoff for s in res]))
True

对数字列表进行分组/聚类，使每个子集的最小-最大差距始终小于 Python 中的截止值

Grouping / clustering a list of numbers so that the min-max gap of each subset is always less than a cutoff in Python

python

algorithm

grouping

cluster-analysis

subset