我如何在字符串中找到最适合的可变长度霍夫曼代码？

Question

我有一个字符串，我想找到最长可变长度的字符，以生成更好的哈夫曼代码：

例如，字符串

"++----++--++--+-++++++-+-+-----++-+---+-+---+-++--"

我从中手动创建了以下霍夫曼符号到频率字符串：

In [1665]: symb2freq                                                                                                                                                                                       
Out[1665]: {'++--': 4, '-+-+-': 2, '--': 1, '+-++++++': 1, '----++-+--': 1, '--+-': 1}

这让我可以用 110011111000011010110011 的霍夫曼二进制字符串来表示：

'++----++--++--+-++++++-+-+-----++-+---+-+---+-++--'

这是大小的一半，比固定长度的 symb2frequencies 更好。我的问题是，是否有正则表达式、itertools 或编程方法来创建多个长度的 symb2frequencies，其中最大长度。在我的例子中，最大长度是 11。最短长度无关紧要，但目标是尽可能获得最佳频率，以减少表示树的霍夫曼代码的大小。如果我按集合大小分组，我无法得到一个短的 Huffman 二进制字符串来表示最佳代码集。

所以我的问题是如何以编程方式使用正则表达式或 itertools 或任何其他方法进行转换：

"++----++--++--+-++++++-+-+-----++-+---+-+---+-++--"

至：

In [1665]: symb2freq                                                                                                                                                                                       
Out[1665]: {'++--': 4, '-+-+-': 2, '--': 1, '+-++++++': 1, '----++-+--': 1, '--+-': 1}

这是我手动构建的。

我认为这是一个难题，因此提前感谢您提供有关如何解决此问题的任何建议。

我写这段代码是为了表明您可以使用可变长度代码将霍夫曼代码解码为所示模式，我用它来重新创建一半大小的数字 1009732533765201。我手动创建了 sym2freq 并且需要有关如何以编程方式创建它的想法的帮助，以便我可以获得比设置分组大小更好的压缩。我有一个程序可以创建一个数学模式来重新创建一个数字，我正在寻找创建可变长度代码以最好地表示那些格式为 + 和 - 's

的模式

# Incorporated a better method of rebuilding the string from RoyM
# Uncompresses the number: 1009732533765201
# which has the binary:
# 0b11100101100101100010101100111111100100010001010001
# Our compressed binary is:
# 110011111000011010110011
# We compress a variable length pattern that is decoded 
# to try to achieve nearly as good compression (maybe better in this case )and to more importantly 
# to compare compression as this is a different approach
# to that of binary compression of Huffman using XOR Patterns
# S is a representation of XOR math that is required to rebuild the original number
# so we don't compress the number itself, we compress the XOR math to rebuild the number
# instead

s={'11': '++--', '01': '-+-+-', '000': '+-++++++', '001': '--', '100': '--+-', '101': '----++-+--'}
compressed_binary='110011111000011010110011'

def decodepattern(pattern, j=1):
   x=2
   for c in range(len(pattern)):
     if pattern[c]=='+' or pattern[c]=='1':
       j^=x
     else:
       j^=-x
     x<<=1
   return abs(j)


i = 0
j = 0
uncompressed_string = ''
while(j<len(compressed_binary)):
    if compressed_binary[i:j] in s:
        uncompressed_string += s[compressed_binary[i:j]]
        i = j
    else:
        j += 1
uncompressed_string += s[compressed_binary[i:j]]
answer = decodepattern(uncompressed_string)

print("Bin length of 1009732533765201 is 50 which is the same as the pattern length")
print("Original Bin of 1009732533765201 is: ")
print(bin(1009732533765201)[2:])
print("Original Pattern:")
print(uncompressed_string)
print("Compressed to: ", bin(13600435)[2:])
print("110011111000011010110011 uncompressed to the original string pattern of : ")
print(uncompressed_string)
print("Which then i was able to recreate the original integer of: ")
print(decodepattern(uncompressed_string))

程序输出为：

Bin length of 1009732533765201 is 50 which is the same as the pattern length
Original Bin of 1009732533765201 is: 
11100101100101100010101100111111100100010001010001
Original Pattern:
b'++----++--++--+-++++++-+-+-----++-+---+-+---+-++--'
Compressed to:  110011111000011010110011
110011111000011010110011 uncompressed to the original string pattern of : 
b'++----++--++--+-++++++-+-+-----++-+---+-+---+-++--'
Which then i was able to recreate the original integer of: 
1009732533765201

所以你可以看到我可以使用可变长度代码来解压缩一个数字，我只是不知道如何在不手动创建可变长度代码的情况下（达到一定长度）。

预期结果是某种正则表达式，或 itertools 方法或任何其他获取字符串的方法：

"++----++--++--+-++++++-+-+-----++-+---+-+---+-++--"

并像这样创建一个 symb2freq（我手动创建的）：

In [1665]: symb2freq                                                                                                                                                                                       
Out[1665]: {'++--': 4, '-+-+-': 2, '--': 1, '+-++++++': 1, '----++-+--': 1, '--+-': 1}

Answer 1

虽然我不确定问题的合理性，但您可以通过以下方式暴力破解所有可能的密码本：

#!/usr/bin/env python3

import huffman


def find_best_codebook(string, max_symbol_length, max_partitions):
    min_entropy = len(string)
    best_i = None
    best_weights = None
    best_codebook = None
    for i, tokens in enumerate(gen_partitions(string, max_partitions=max_partitions)):
        if any(len(t) > max_symbol_length for t in tokens):
            continue
        symbol_weights = compute_weights(tokens)
        codebook = huffman.codebook(symbol_weights.items())
        entropy = compute_entropy(symbol_weights, codebook)
        print(f'i={i} X={symbol_weights} H(X)={entropy}')

        if entropy < min_entropy:
            best_i = i
            best_weights = symbol_weights
            best_codebook = codebook
            min_entropy = entropy

    print(f'best i={best_i} X={best_weights} H(X)={min_entropy}')
    print('best codebook:', best_codebook)
    return best_codebook, min_entropy


def gen_partitions(string, max_partitions=None):
    if max_partitions is None:
        max_partitions = len(string)

    if max_partitions <= 0:
        yield [string]
        return

    for index in range(len(string) + 1):
        for rest in gen_partitions(string[index:], max_partitions - 1):
            yield [string[:index]] + rest


def compute_weights(tokens):
    freqs = {}
    for token in tokens:
        if not token:
            continue
        freqs[token] = freqs.get(token, 0) + 1
    return freqs


def compute_entropy(symbol_weights, codebook):
    return sum(len(code) * symbol_weights[symbol] for symbol, code in codebook.items())


if __name__ == '__main__':
    string = '++----++--++--+-++++++-+-+-----++-+---+-+---+-++--'
    max_symbol_length = 11
    max_partitions = len(string) - max_symbol_length
    find_best_codebook(string, max_symbol_length, max_partitions)

请注意，将长度为 N 的字符串划分为所有可能的分区是字符串长度的指数（N 个分割点，每个点有两个决定：分割和不分割，因此 O(2^N)）。

此外，鉴于问题的形成方式，您应该能够通过简单地将输入分成最大码字长度的连续块来获得最佳压缩。例如，使用问题中所述的最大符号长度 11，您可以执行 {'++----', '++--++--+-+', '+++++-+-+--', '---++-+---+', '-+---+-++--'} 以获得 5 个符号（最多 3 位代码字）和最多 15 位的压缩文本。

我如何在字符串中找到最适合的可变长度霍夫曼代码？

How do i find variable length Huffman codes in a string for a best fit?

python

regex

huffman-code