为什么我的霍夫曼压缩使压缩后的输出比我的原始文本大?

Why is my Huffman Compression making the Compressed output a larger size than my original text?

我正在尝试在 Python 中编写程序以使用 Huffman 压缩来压缩文本。我遇到的问题是压缩文本在保存到文本文件时最终会比原始文本大,我不确定为什么会这样。我实现了一个名为 Heapnode 的 class 来帮助我构建优先级队列并使用 Heapq 构建我的二叉树。在 class HuffmanCoding 中,我实现了获取每个字符的频率、创建优先级队列、使用它将 'nodes' 合并成一种二叉树,并遍历该树以构建霍夫曼代码的方法对于每个字符。



class HeapNode:
    def __init__(self, char, freq):
        self.char = char
        self.freq = freq
        self.left = None
        self.right = None

    def __lt__(self, other):  # if the frequency of one character is lower than the frequency of another one
        return self.freq < other.freq

    def __eq__(self, other):  # if two characters have the same frequencies
        if other == None:
            return False
        if not isinstance(other, HeapNode):  # checks if the character is a node or not
            return False
        return self.freq == other.freq


class HuffmanCoding:
    def __init__(self, text_to_compress):
        self.text_to_compress = text_to_compress  # text that will be compressed
        self.heap = []
        self.codes = {}  # will store the Huffman code of each character
        self.decompress_map = {}

    def get_frequency(self):  # method to find frequency of each character in text - RLE
        frequency_Dictionary = {}  # creates an empty dictionary where frequency of each character will be stored

        for character in self.text_to_compress:  # Iterates through the text to be compressed
            if character in frequency_Dictionary:
                frequency_Dictionary[character] = frequency_Dictionary[character] + 1  # if character already exists in
                # dictionary, its value is increased by 1
            else:
                frequency_Dictionary[character] = 1  # if character is not present in list, its value is set to 1

        return frequency_Dictionary

    def make_queue(self, frequency):  # creates the priority queue of each character and its associated frequency
        for key in frequency:
            node = HeapNode(key, frequency[key])  # create node (character) and store its frequency alongside it
            heapq.heappush(self.heap, node)  # Push the node into the heap

    def merge_nodes(
            self):  # creates HuffmanTree by getting the two minimum nodes and merging them together, until theres
        # only one node left
        while len(self.heap) > 1:
            node1 = heapq.heappop(self.heap)  # pop node from top of heap
            node2 = heapq.heappop(self.heap)  # pop next node which is now at the top of heap

            merged = HeapNode(None, node1.freq + node2.freq)  # merge the two nodes we popped out from heap
            merged.left = node1
            merged.right = node2

            heapq.heappush(self.heap, merged)  # push merged node back into the heap

    def make_codes(self, root, current_code):  # Creates Huffman code for each character
        if root == None:
            return

        if root.char != None:
            self.codes[root.char] = current_code
            self.decompress_map[current_code] = root.char

        self.make_codes(root.left, current_code + "0")  # Every time you traverse left, add a 0 - Recursive Call
        self.make_codes(root.right, current_code + "1")  # Every time you traverse right, add a 1 - Recursive Call

    def assignCodes(self):  # Assigns codes to each character
        root = heapq.heappop(self.heap)  # extracts root node from heap
        current_code = ""
        self.make_codes(root, current_code)

    def get_compressed_text(self, text):  # Replaces characters in original text with codes
        compressed_text = ""
        for character in text:
            compressed_text += self.codes[character]
        return compressed_text

    def show_compressed_text(self):

        frequency = self.get_frequency()
        self.make_queue(frequency)
        self.merge_nodes()
        self.assignCodes()

        compressed_text = self.get_compressed_text(self.text_to_compress)
        return compressed_text


print(HuffmanCoding('This sentence will get compressed').show_compressed_text())

您的代码是包含 ASCII 字符“0”和“1”的字符串。每个字符占用八位,因此您将压缩数据扩展八倍。

您需要制作可变长度的 二进制 代码,其中一位占 space 的一位。然后,您需要能够连接这些可变长度代码以构成一个字节序列(以 b'' 开始,而不是 ""),并根据需要用零位填充最后一个字节。然后你有一个 字节 的序列,每个包含你的代码中的八个位。每个都可以有 256 个可能的字节值中的任何一个。

您可以在整数上使用位运算符来构造它,特别是移位:<<>>),或:|,和:&。您可以使用 bytes() 将整数转换为字节。例如:

>>> bytes([65,66,67])
b'ABC'

另请注意,您正在压缩一个非常短的字符串,即使您将输出正确地写为位也不会被压缩。特别是一旦您将代码与其一起发送。您需要测试更多的文本,以便压缩能够利用英语中不同字母的频率。