用字典解码霍夫曼码

Question

我需要使用包含 ASCII 和霍夫曼位之间转换的文件来解码我用我的程序编码的霍夫曼代码。我在程序中已经有一本从 "codes" 到 ASCII 的字典，如下所示：

{'01110': '!', '01111': 'B', '10100': 'l', '10110': 'q', '10111': 'y'}

我创建了函数：

def huffmanDecode (dictionary, text) :

这需要字典和代码。我尝试在字典中搜索文本中的键并使用 replace 方法形式 string 和 sub 来自 re 但他们都没有正确解码消息。例如，如果代码是：

011111011101110

将其解码为：

应该很简单

By!

但我无法通过遍历代码并在字典中搜索匹配来做到这一点！

如何通过在文本中找到键并将其替换为值来使用字典中的键及其值解码代码？

非常感谢任何帮助。

Answer 1

不确定您尝试了什么，但 re.sub 或 replace 可能不起作用，因为它们不一定从字符串的开头开始替换。您必须查看字符串以什么代码开头，然后替换该代码，然后继续处理字符串的其余部分。

例如，像这样：

def huffmanDecode (dictionary, text):
    res = ""
    while text:
        for k in dictionary:
            if text.startswith(k):
                res += dictionary[k]
                text = text[len(k):]
    return res

或递归：

def huffmanDecode (dictionary, text):
    if text:
        k = next(k for k in dictionary if text.startswith(k))
        return dictionary[k] + huffmanDecode(dictionary, text[len(k):])
    return ""

您还可以根据您的代码制作一个正则表达式，然后使用 re.match 查找下一个：

import re
def huffmanDecode (dictionary, text):
    p = "|".join(dictionary) # join keys to regex
    res = ""
    while text:
        m = re.match(p, text)
        res += dictionary[m.group()]
        text = text[len(m.group()):]
    return res

注意：这些都没有正确的错误处理，如果代码与消息不匹配，将失败或永远循环，但这应该让你开始。

Answer 2

使用 bitarray 模块，您可以免费获得 huffman en-/de-coding，并且可能比其他任何东西都更有效：

from bitarray import bitarray

huffman_dict = {
    '!': bitarray('01110'), 'B': bitarray('01111'),
    'l': bitarray('10100'), 'q': bitarray('10110'),
    'y': bitarray('10111')
}

a = bitarray()
a.encode(huffman_dict, 'Bylly!')
print(a)

dec = bitarray('011111011101110').decode(huffman_dict)
print(dec)
print(''.join(dec))

# # output:
# bitarray('011111011110100101001011101110')
# ['B', 'y', '!']
# By!

如果您不想安装该模块，请阅读下面的部分。

这是一个使用 霍夫曼树 解码的变体 - 该程序可以运行，但可能有更好的变体来表示二叉树（我选择了一个元组）。

当您的码字长度不尽相同时，此版本可能更适合。二叉树的另一个好处是，这里的代码很明显是 prefix-free.

你在 tree-form 中的代码看起来像这样（过度缩进以使 tree-structure 可见）：

huffman_tree = \
    (   # 0
        (   # 00
            None,
            # 01
            (   # 010
                None,
                # 011
                (   # 0110
                    None,
                    # 0111
                    (   # 01110
                        '!',
                        # 01111
                        'B')))),
        # 1
        (   # 10
            (   # 100
                None,
                # 101
                (   # 1010
                    (   # 10100
                        'l',
                        # 10101
                        None
                    ),
                    # 1011
                    (   # 10110
                        'q',
                        # 10111
                        'y'))),
            # 11
            None))

使用它你可以解码：

def huffman_decode(strg):
    ret = ''
    cur_node = huffman_tree
    for char in strg:
        cur_node = cur_node[int(char)]
        if cur_node is None:
            raise ValueError
        elif isinstance(cur_node, str):
            ret += cur_node
            cur_node = huffman_tree
    return ret

print(huffman_decode('011111011101110'))

如果解码命中 None 发生了一些错误并引发了 ValueError。一旦解码命中字符串，当前节点 cur_node 将重置为 'root node' 并且游戏从树的开头开始。

只是因为我可以：这是你的（不完整的）霍夫曼树的可视化显示（这可能有助于理解算法的作用：每当你遇到 0：向右+向下；每当你遇到一个1：向右+向上）；如果您点击 end-node: return 该节点处的字符并在根节点处重新启动。

用字典解码霍夫曼码

Decoding a Huffman code with a dictionary

python

huffman-code