创建一个非贪婪的 LZW 算法

Question

基本上，我正在做 IB 计算机科学扩展论文，并且正在考虑使用 LZW 算法的非贪婪实现。我找到了以下链接：

并且一直在假设paper 1中描述的算法和paper 2中的LZW-FP本质上是一样的。无论哪种方式，跟踪论文 1 中的伪代码都是一种痛苦的经历，没有任何结果，用我老师的话来说 "is incredibly difficult to understand." 如果有人能弄清楚如何跟踪它，或者碰巧之前研究过算法并且知道它是如何工作的，那将是一个很大的帮助。

Answer 1

注意：我把你所说的"paper 1"称为Horspool 1995 and "paper 2" as Matias et al 1998。我只看过 Horspool 1995 中的 LZW 算法，所以如果你指的是 LZSS 算法，这对你帮助不大。

我的理解是Horspool的算法是Matias et al 1998的作者所说的"LZW-FPA"，与他们所说的"LZW-FP"不同；不同之处在于算法决定将哪些子字符串添加到字典中的方式。由于 "LZW-FP" 将与 LZW 添加的完全相同的子字符串添加到字典中，因此 LZW-FP 无法为任何字符串生成更长的压缩序列。 LZW-FPA（和 Horspool 的算法）在每个输出周期添加贪婪匹配的后继字符串。这不是同一个子字符串（因为贪婪匹配不会像在 LZW 中那样从同一点开始），因此理论上它可能会产生比 LZW 更长的压缩序列。

Horspool 的算法实际上非常简单，但它的缺点在于所提供的伪代码中存在几个愚蠢的错误。实施该算法是检测和修复这些错误的好方法；我在下面放了伪代码的注释版本。

类似 LZW 的算法将输入分解为一系列块。压缩器维护可用块的字典（具有关联的代码字）。最初，字典包含所有单字符字符串。然后它遍历输入，在每个点找到其字典中该点的最长前缀。找到该块后，它会输出其代码字，并将附加了下一个输入字符的块添加到字典中。（因为找到的块是字典中最长的前缀，所以块加上下一个字符不能在字典中。）然后它在块上前进，并在下一个输入点（就在最后一个字符之前）继续阻止它刚刚添加到字典中）。

Horspool 的修改也是在每个点找到最长的前缀，并将该前缀扩展一个字符添加到字典中。但它不会立即输出该块。相反，它考虑贪婪匹配的前缀，并针对每个前缀计算出下一个贪婪匹配是什么。这给了它两个块的候选范围；它选择具有最佳进展的范围。为了避免在此搜索中耗费太多时间，该算法由它将测试的前缀数量参数化，假设短得多的前缀不太可能产生更长的范围。（Horspool 为这种启发式方法提供了一些证据，尽管您可能想通过自己的实验来验证这一点。）

在 Horspool 的伪代码中，α 就是我所说的 "candidate match" —— 即在上一步找到的贪婪匹配 —— 而 β_j 是在 α 的第 j^th 个前缀之后的输入点的贪婪后继匹配。（从最后算起，所以β₀恰好是α的贪心后继匹配，结果K为0就会产生LZW算法。我想Horspool在某处提到过这个事实。 ) L 就是 α 的长度。该算法最终将使用 α 的某个前缀，可能（通常）全部使用它。

这是图 2 中 Horspool 的伪代码以及我的注释：

initialize dictionary D with all strings of length 1;
set α = the string in D that matches the first
        symbol of the input;
set L = length(α);
while more than L symbols of input remain do
begin
    // The new string α++head(β0) must be added to D here, rather
    // than where Horspool adds it. Otherwise, it is not available for the
    // search for a successor match. Of course, head(β0) is not meaningful here
    // because β0 doesn't exist yet, but it's just the symbol following α in
    // the input.
    for j := 0 to max(L-1,K) do
        // The above should be min(L - 1, K), not max.
        // (Otherwise, K would be almost irrelevant.)
        find βj, the longest string in D that matches
            the input starting L-j symbols ahead;
    add the new string α++head(β0) to D;
    // See above; the new string must be added before the search
    set j = value of j in range 0 to max(L-1,K)
            such that L - j + length(βj) is a maximum;
    // Again, min rather than max
    output the index in D of the string prefix(α,j);
    // Here Horspool forgets that j is the number of characters removed
    // from the end of α, not the number of characters in the desired prefix.
    // So j should be replaced with L - j
    advance j symbols through the input;
    // Again, the advance should be L - j, not j
    set α = βj;
    set L = length(α);
end;
output the index in D of string α;

创建一个非贪婪的 LZW 算法

Creating a Non-greedy LZW algorithm

compression

algorithm

computer-science

data-compression