基于阈值的聚类

clustering based on thresholds

编辑(简体)

我很确定,我错过了 "google" 这个问题的正确术语,如果之前有人问过,请指点一下: 我有一个树结构,可以说如下

(0)->(0,0:7)
     (0,1:9)
(1)->(1,0:6)
     (1,1:2)
     (1,2:1)

为简单起见,让我们将其转换为平面结构

l1, l2, v1
0, 0, 7
0, 1, 9
1, 0, 6
1, 1, 2
1, 2, 1

现在让我们在这棵树上设置 3 的阈值。这意味着我们要保留高于阈值的节点,并合并分支中低于阈值的所有节点。

所以我们最终得到的是最后两行(因为它们低于阈值)以注释 'yielded' 作为两个的总和结束:

l1, l2, v1
0, 0, 7
0, 1, 9
1, 0, 6
1, (1,2),  3

理想情况下,python 中的解决方案将不胜感激。显然会有我很乐意处理的边缘条件。请注意,实际上我最终可以得到 6 深的树。

所以我终于按照我之前暗示过的繁琐方式完成了。 我开始向我的节点定义添加一些标志(dodelete=Falsevisited=False)。

并将 add_node 方法更新为

def add_child(self, node):
    node.parent = self
    node.level = self.level + 1
    self.children.append(node)
    return node

其中 self.children 是节点列表

然后两种方法

def collapse_nodes(tree, thresh=3):
    for n in tree:
        if n.dodelete:
            continue
        sub_tree = tree.get_by_path (n.id)
        sub_tree_stack = []
        for child in sub_tree.children:
           if child.val is not None and thresh > child.val:
               sub_tree_stack.append(child)
               tree.get_by_path(child.id).dodelete = True
           if sub_tree_stack:
              sub_tree.add_child(Node(",".join([subnode.name for subnode in sub_tree_stack]), 
                                 val = sum([subnode.val for subnode in sub_tree_stack]), 
                                 id= sorted([subnode.id for subnode in sub_tree_stack])[0]))
    return tree

def roll_up(tree, thresh = 2, level=5):
    for n in tree:
        if n.dodelete or n.visited:
            continue
        if n.level != level:
            continue
        sub_tree = tree.get_by_path(n.id)
        if sub_tree is None:
            continue
        sub_tree_stack = []
        for child in tree.get_by_path(n.id).children:
            if child.val is not None and child.val <= thresh:
                sub_tree_stack.append((child.name, child.id, child.val))
                # Also mark this for deletion
                tree.get_by_path(child.id).dodelete = True
        if sub_tree_stack:
            # Get the parent for these nodes
            node_name = n.name + ": [" + ",".join([subnode[0] for subnode in sub_tree_stack]) + "]"
            node_val = sum([subnode[2]for subnode in sub_tree_stack])
            node_id = sorted([subnode[1] for subnode in sub_tree_stack])[0]
            parent_name = n.parent.name
            parent_level = n.parent.level
            parent_id= n.parent.id
            # Now ensure that you delete the old node before adding new
            tree.get_by_path(n.id).dodelete = True

            tree.get_by_path(parent_id).add_child(Node(node_name, val=node_val, id = n.id, visited=True) )
    return tree

这样做有点复杂,但是有效。我已经为任何试图测试它的任性灵魂创建了一个要点 https://gist.github.com/fahaddaniyal/0dc86c80f266fd9f8cdb