为什么我的霍夫曼压缩使压缩后的输出比我的原始文本大?
Why is my Huffman Compression making the Compressed output a larger size than my original text?
我正在尝试在 Python 中编写程序以使用 Huffman 压缩来压缩文本。我遇到的问题是压缩文本在保存到文本文件时最终会比原始文本大,我不确定为什么会这样。我实现了一个名为 Heapnode 的 class 来帮助我构建优先级队列并使用 Heapq 构建我的二叉树。在 class HuffmanCoding 中,我实现了获取每个字符的频率、创建优先级队列、使用它将 'nodes' 合并成一种二叉树,并遍历该树以构建霍夫曼代码的方法对于每个字符。
class HeapNode:
def __init__(self, char, freq):
self.char = char
self.freq = freq
self.left = None
self.right = None
def __lt__(self, other): # if the frequency of one character is lower than the frequency of another one
return self.freq < other.freq
def __eq__(self, other): # if two characters have the same frequencies
if other == None:
return False
if not isinstance(other, HeapNode): # checks if the character is a node or not
return False
return self.freq == other.freq
class HuffmanCoding:
def __init__(self, text_to_compress):
self.text_to_compress = text_to_compress # text that will be compressed
self.heap = []
self.codes = {} # will store the Huffman code of each character
self.decompress_map = {}
def get_frequency(self): # method to find frequency of each character in text - RLE
frequency_Dictionary = {} # creates an empty dictionary where frequency of each character will be stored
for character in self.text_to_compress: # Iterates through the text to be compressed
if character in frequency_Dictionary:
frequency_Dictionary[character] = frequency_Dictionary[character] + 1 # if character already exists in
# dictionary, its value is increased by 1
else:
frequency_Dictionary[character] = 1 # if character is not present in list, its value is set to 1
return frequency_Dictionary
def make_queue(self, frequency): # creates the priority queue of each character and its associated frequency
for key in frequency:
node = HeapNode(key, frequency[key]) # create node (character) and store its frequency alongside it
heapq.heappush(self.heap, node) # Push the node into the heap
def merge_nodes(
self): # creates HuffmanTree by getting the two minimum nodes and merging them together, until theres
# only one node left
while len(self.heap) > 1:
node1 = heapq.heappop(self.heap) # pop node from top of heap
node2 = heapq.heappop(self.heap) # pop next node which is now at the top of heap
merged = HeapNode(None, node1.freq + node2.freq) # merge the two nodes we popped out from heap
merged.left = node1
merged.right = node2
heapq.heappush(self.heap, merged) # push merged node back into the heap
def make_codes(self, root, current_code): # Creates Huffman code for each character
if root == None:
return
if root.char != None:
self.codes[root.char] = current_code
self.decompress_map[current_code] = root.char
self.make_codes(root.left, current_code + "0") # Every time you traverse left, add a 0 - Recursive Call
self.make_codes(root.right, current_code + "1") # Every time you traverse right, add a 1 - Recursive Call
def assignCodes(self): # Assigns codes to each character
root = heapq.heappop(self.heap) # extracts root node from heap
current_code = ""
self.make_codes(root, current_code)
def get_compressed_text(self, text): # Replaces characters in original text with codes
compressed_text = ""
for character in text:
compressed_text += self.codes[character]
return compressed_text
def show_compressed_text(self):
frequency = self.get_frequency()
self.make_queue(frequency)
self.merge_nodes()
self.assignCodes()
compressed_text = self.get_compressed_text(self.text_to_compress)
return compressed_text
print(HuffmanCoding('This sentence will get compressed').show_compressed_text())
您的代码是包含 ASCII 字符“0”和“1”的字符串。每个字符占用八位,因此您将压缩数据扩展八倍。
您需要制作可变长度的 二进制 代码,其中一位占 space 的一位。然后,您需要能够连接这些可变长度代码以构成一个字节序列(以 b''
开始,而不是 ""
),并根据需要用零位填充最后一个字节。然后你有一个 字节 的序列,每个包含你的代码中的八个位。每个都可以有 256 个可能的字节值中的任何一个。
您可以在整数上使用位运算符来构造它,特别是移位:<<
、>>
),或:|
,和:&
。您可以使用 bytes()
将整数转换为字节。例如:
>>> bytes([65,66,67])
b'ABC'
另请注意,您正在压缩一个非常短的字符串,即使您将输出正确地写为位也不会被压缩。特别是一旦您将代码与其一起发送。您需要测试更多的文本,以便压缩能够利用英语中不同字母的频率。
我正在尝试在 Python 中编写程序以使用 Huffman 压缩来压缩文本。我遇到的问题是压缩文本在保存到文本文件时最终会比原始文本大,我不确定为什么会这样。我实现了一个名为 Heapnode 的 class 来帮助我构建优先级队列并使用 Heapq 构建我的二叉树。在 class HuffmanCoding 中,我实现了获取每个字符的频率、创建优先级队列、使用它将 'nodes' 合并成一种二叉树,并遍历该树以构建霍夫曼代码的方法对于每个字符。
class HeapNode:
def __init__(self, char, freq):
self.char = char
self.freq = freq
self.left = None
self.right = None
def __lt__(self, other): # if the frequency of one character is lower than the frequency of another one
return self.freq < other.freq
def __eq__(self, other): # if two characters have the same frequencies
if other == None:
return False
if not isinstance(other, HeapNode): # checks if the character is a node or not
return False
return self.freq == other.freq
class HuffmanCoding:
def __init__(self, text_to_compress):
self.text_to_compress = text_to_compress # text that will be compressed
self.heap = []
self.codes = {} # will store the Huffman code of each character
self.decompress_map = {}
def get_frequency(self): # method to find frequency of each character in text - RLE
frequency_Dictionary = {} # creates an empty dictionary where frequency of each character will be stored
for character in self.text_to_compress: # Iterates through the text to be compressed
if character in frequency_Dictionary:
frequency_Dictionary[character] = frequency_Dictionary[character] + 1 # if character already exists in
# dictionary, its value is increased by 1
else:
frequency_Dictionary[character] = 1 # if character is not present in list, its value is set to 1
return frequency_Dictionary
def make_queue(self, frequency): # creates the priority queue of each character and its associated frequency
for key in frequency:
node = HeapNode(key, frequency[key]) # create node (character) and store its frequency alongside it
heapq.heappush(self.heap, node) # Push the node into the heap
def merge_nodes(
self): # creates HuffmanTree by getting the two minimum nodes and merging them together, until theres
# only one node left
while len(self.heap) > 1:
node1 = heapq.heappop(self.heap) # pop node from top of heap
node2 = heapq.heappop(self.heap) # pop next node which is now at the top of heap
merged = HeapNode(None, node1.freq + node2.freq) # merge the two nodes we popped out from heap
merged.left = node1
merged.right = node2
heapq.heappush(self.heap, merged) # push merged node back into the heap
def make_codes(self, root, current_code): # Creates Huffman code for each character
if root == None:
return
if root.char != None:
self.codes[root.char] = current_code
self.decompress_map[current_code] = root.char
self.make_codes(root.left, current_code + "0") # Every time you traverse left, add a 0 - Recursive Call
self.make_codes(root.right, current_code + "1") # Every time you traverse right, add a 1 - Recursive Call
def assignCodes(self): # Assigns codes to each character
root = heapq.heappop(self.heap) # extracts root node from heap
current_code = ""
self.make_codes(root, current_code)
def get_compressed_text(self, text): # Replaces characters in original text with codes
compressed_text = ""
for character in text:
compressed_text += self.codes[character]
return compressed_text
def show_compressed_text(self):
frequency = self.get_frequency()
self.make_queue(frequency)
self.merge_nodes()
self.assignCodes()
compressed_text = self.get_compressed_text(self.text_to_compress)
return compressed_text
print(HuffmanCoding('This sentence will get compressed').show_compressed_text())
您的代码是包含 ASCII 字符“0”和“1”的字符串。每个字符占用八位,因此您将压缩数据扩展八倍。
您需要制作可变长度的 二进制 代码,其中一位占 space 的一位。然后,您需要能够连接这些可变长度代码以构成一个字节序列(以 b''
开始,而不是 ""
),并根据需要用零位填充最后一个字节。然后你有一个 字节 的序列,每个包含你的代码中的八个位。每个都可以有 256 个可能的字节值中的任何一个。
您可以在整数上使用位运算符来构造它,特别是移位:<<
、>>
),或:|
,和:&
。您可以使用 bytes()
将整数转换为字节。例如:
>>> bytes([65,66,67])
b'ABC'
另请注意,您正在压缩一个非常短的字符串,即使您将输出正确地写为位也不会被压缩。特别是一旦您将代码与其一起发送。您需要测试更多的文本,以便压缩能够利用英语中不同字母的频率。