从 trie 中删除频率 < 5 和长度 > 15 的单词的算法

Algorithm to remove words from a trie whose frequency < 5 and length > 15

我有一个巨大的 trie 字典,它是根据网络数据构建的。虽然当我将 trie 写入文件时它只有 5MB,但当我将它加载到内存时它的大小非常大(超过 100 MB)。所以我必须压缩 trie。

我在编写递归函数(最好像 DFS 一样以线性时间运行)来删除频率 < 5 且长度 > 15 的单词时遇到困难。感谢您的帮助

这是我的 trie 结构。

class TrieNode:
    def __init__(self):
        self.ch = '|'
        self.score = 0
        self.childs = [None]*26
        self.isWord = False


class Trie:
    def __init__(self):
        self.root = TrieNode('$')

    @staticmethod
    def print_trie(node, level):
        if node is None:
            return
        print(node.ch, " ", level, " ", node.isWord)
        for i in range(26):
            Trie.print_trie(node.childs[i], level+1)

    def insert(self, word):
        word = word.lower()
        if not is_valid(word):
            return
        childs = self.root.childs
        i = 0
        while i < len(word):
            idx = to_int(word[i])
            if childs[idx] is not None:
                t = childs[idx]
            else:
                t = TrieNode(word[i])
                childs[idx] = t
            childs = t.childs

            if i == len(word)-1:
                t.isWord = True
                t.score += 1
            i += 1

    def search_node(self, word):
        word = word.lower()
        if not is_valid(word):
            return False, 0
        if self.root is None or word is None or len(word) == 0:
            return False, 0

        children = self.root.childs
        for i in range(len(word)):
            idx = to_int(word[i])
            if children[idx] is not None:
                t = children[idx]
                children = t.childs
            else:
                return False, 0
        if t.isWord:
            return True, t.score
        else:
            return False, t.score

I am not sure if removing will solve the problem. The space consumed is not because of the words but because of the 26 children every node has.

例如。我有一个单词 cat 的频率为 30,还有另一个单词 cater 的频率为 10。因此,如果删除 [=15= 的节点]t in cat 那么后面的所有节点都会被删除(也就是cater会缩减为)

因此,从 Trie 中删除一个词意味着将其分数设置为 0。

以下方法采用一个节点及其级别(最初传入根和 0)和 returns 如果节点在修剪后应保持活动状态则为 True,如果节点应从 trie 中删除则为 False(使用它的子树)。

def prune(node, level):
  if node is None:
    return False
  canPruneNode = True
  for idx in xrange(len(node.children)):
    # If any of the children remains alive, don't prune current node.
    if prune(children[idx], level + 1):
      canPruneNode = False
    else:
      # Remove dead child.
      node.children[idx] = None
  if node.isWord and level > 15 and node.score < 5:
    node.isWord = False
  # Current node should be removed if and only if all of its children
  # were removed and it doesn't represent a word itself after pruning.
  return node.isWord or not canPruneNode