为什么关于大索引类型的函数 huffmandeco 出现八度错误？

Question

我有一点 MatLab 脚本，我试图理解它。它的作用不大。它仅从文件中读取文本并使用霍夫曼函数对其进行编码和解码。但它在解码时抛出错误：

“错误：内存不足或维度对于 Octave 的索引类型来说太大
错误：从第 95 行第 19 列的 huffmandeco>dict2tree 调用

不知道为什么，我调试了一下，没有看到大索引类型

我添加了根据输入文本计算 p 的部分。

%text is a random input text file in ASCII

%calculate the relative frequency of every Symbol
for i=0:127
    nlet=length(find(text==i));
    p(i+1)=nlet/length(text);
end
symb = 0:127;
dict = huffmandict(symb,p); % Create dictionary
compdata = huffmanenco(fdata,dict); % Encode the data
dsig = huffmandeco(compdata,dict); % Decode the Huffman code

我只能使用 Octave 而不是 MatLab。我不知道，如果有意外错误。我在 Win10 上使用 Octave Version 6.2.0。我试过大数据的版本，它没有改变任何东西。
也许有人知道这方面的错误？

编辑：我再次调试了代码。在函数 huffmandeco 中，我发现了以下函数：

function tree = dict2tree (dict)

  L = length (dict);
  lengths = zeros (1, L);

  ## the depth of the tree is limited by the maximum word length.
  for i = 1:L
    lengths(i) = length (dict{i});
  endfor
  m = max (lengths);

  tree = zeros (1, 2^(m+1)-1)-1;

  for i = 1:L
    pointer = 1;
    word    = dict{i};
    for bit = word
      pointer = 2 * pointer + bit;
    endfor
    tree(pointer) = i;
  endfor

endfunction

本例最大长度m为82，所以函数计算：
树 = 零点 (1, 2^(82+1)-1)-1.
所以很明显为什么错误称为索引类型过大。
但是一定有解决办法或者其他错误，因为之前测试过代码。

Answer 1

我还没有仔细检查代码以了解原因，但是 huffmandict 并没有像它声称的那样忽略零概率符号。我也没有找到关于 Savannah 的错误报告，但我又没有彻底搜索。

解决方法是将符号列表及其概率限制为仅实际出现的符号。使用 containers.Map 是理想的，但在 Octave 中，您可以使用 unique:

的几个输出来做到这一点

% Create a symbol table of the unique characters in the input string
% and the indices into the table for each character in the string.
[symbols, ~, inds] = unique(textstr);
inds = inds.';   % just make it easier to read

对于字符串

textstr = 'Random String Input.';

结果是：

>> symbols
symbols =  .IRSadgimnoprtu
>> inds
inds =
 Columns 1 through 19:
    4    6   11    7   12   10    1    5   15   14    9   11    8    1    3   11   13   16   15
 Column 20:
    2

因此输入字符串中的第一个符号是symbols(4)，第二个是symbols(6)，依此类推。

从那里，您只需使用 symbols 和 inds 创建字典和 encode/decode 信号。这是一个快速演示脚本：

textstr = 'Random String Input.';
fprintf("Starting string: %s\n", textstr);

% Create a symbol table of the unique characters in the input string
% and the indices into the table for each character in the string.
[symbols, ~, inds] = unique(textstr);
inds = inds.';   % just make it easier to read

% Calculate the frequency of each symbol in table
% max(inds) == numel(symbols)
p = histc(inds, 1:max(inds))/numel(inds);

dict = huffmandict(symbols, p);
compdata = huffmanenco(inds, dict);
dsig = huffmandeco(compdata, dict);

fprintf("Decoded string: %s\n", symbols(dsig));

并且输出：

Starting string: Random String Input.
Decoded string: Random String Input.

要对原始输入字符串以外的字符串进行编码，您必须将字符映射到符号索引（显然，确保字符串中的所有符号实际上都存在于符号 table 中）：

>> [~, s_idx] = ismember('trogdor', symbols)
s_idx =
   15   14   12    8    7   12   14

>> compdata = huffmanenco(s_idx, dict);
>> dsig = huffmandeco(compdata, dict);
>> fprintf("Decoded string: %s\n", symbols(dsig));
Decoded string: trogdor

为什么关于大索引类型的函数 huffmandeco 出现八度错误？

Why octave error with function huffmandeco about large index types?

octave

huffman-code