LZ77优化

LZ77 optimization

我有LZ77压缩算法的代码。它适用于小文件。但是如果我想压缩 100 kB 和更大的文件,它会花费很多时间。

我想,都是因为这部分:

do { // searchin for longest mathcing
                j++;
                i++;
                if(!lookBuffer[i]) fread(&lookBuffer[i], 1, 1, from);

                if (j == strlen(searchBuffer) && lookBuffer[addJ] == lookBuffer[i]) {
                    do {
                        i++;
                        j++;
                        if(!lookBuffer[i]) fread(&lookBuffer[i], 1, 1, from);
                        addJ++;
                    } while (lookBuffer[addJ] == lookBuffer[i] && i < lookLen);
                }
            } while (searchBuffer[j] == lookBuffer[i] && i < lookLen);

这部分搜索搜索缓冲区和先行缓冲区之间的最长匹配。

那么,有没有更快的方法找到这个匹配项?

LZ77 压缩的完整代码:

 void lz77(char *fileName, FILE *from, FILE *path, long long size) {

char *searchBuffer = (char*)malloc((searchLen) * sizeof(char)); // search array
memset(searchBuffer, 0x00, searchLen);

long i = 0, j = 0, length = 0, offset = 0, isMatching = 0, iForSearchBuff = 0, iForLookBuff = 0;
int isFirst = 1, n = 0;
unsigned char symbol;

while(!feof(from)) {
    isMatching = 0;
    length = 0;
    iForSearchBuff = 0;
    iForLookBuff = 0;
    offset = strlen(searchBuffer);
    if (isFirst) { // first symbol
        fread(searchBuffer, 1, 1, from); 
        writeBit(0, 0, *searchBuffer, tmpFile); // writing(0, 0, symbol)
        isFirst = 0;
    } else {
        char *lookBuffer = (char*)malloc((lookLen) * sizeof(char)); // look array
        memset(lookBuffer, 0x00, lookLen);

        fread(lookBuffer, 1, 1, from);

        for (j = 0; j <= strlen(searchBuffer); j++) {

            i = 0;
            if (searchBuffer[j] == lookBuffer[i]) { // MATCHING
                isMatching = 1;
                long fixJ = j, fixI = i, addJ = 0;

                do { // searchin for longest mathcing
                    j++;
                    i++;
                    if(!lookBuffer[i]) fread(&lookBuffer[i], 1, 1, from);

                    if (j == strlen(searchBuffer) && lookBuffer[addJ] == lookBuffer[i]) {
                        do {
                            i++;
                            j++;
                            if(!lookBuffer[i]) fread(&lookBuffer[i], 1, 1, from);
                            addJ++;
                        } while (lookBuffer[addJ] == lookBuffer[i] && i < lookLen);
                    }
                } while (searchBuffer[j] == lookBuffer[i] && i < lookLen);

                if (((strlen(searchBuffer) - fixJ) <= offset) && ((i - fixI) >= length)) { 
                    length = i - fixI;
                    offset = strlen(searchBuffer) - fixJ;
                    iForSearchBuff = fixI;
                    iForLookBuff = i;
                }

                j = fixJ;

            } else if (j >= (strlen(searchBuffer))) {
                if (isMatching) {
                    if (lookBuffer[iForLookBuff] == '[=11=]') { writeBit(offset, length, ' ', tmpFile);}
                    else {
                        writeBit(offset, length, lookBuffer[iForLookBuff], tmpFile);
                        searchBuffer = insertIntoBuffer(searchBuffer, length + 1, &lookBuffer[iForSearchBuff]);
                    } 

                    isMatching = 0;
                    break;
                } else {
                    writeBit(0, 0, lookBuffer[i], tmpFile);

                    searchBuffer = insertIntoBuffer(searchBuffer, 1, &lookBuffer[i]);
                    break;
                }
            }

        }
    }
}

}

我不是专家,但聪明人已经设计出可能适用于此的高效搜索算法。 例如,查看 Knuth–Morris–Pratt 算法 https://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm

保持 table 由 window 中每个字符串的第一个 'n' 字节索引——其中 'n' 是您想要的最小长度匹配(可能是 2 ).您可能有多个字符串以相同的 'n' 字节开头,因此此 table 中的条目是字符串列表以及每个字符串在 window 中的偏移量。您可能希望在距 window 末尾的最小偏移量处进行最长匹​​配。值得专门处理以同一字节的长 运行 开头的字符串。当您向前移动 window 时,您需要从 table 中删除字符串并添加新字符串。

zlib 的 deflate 使用每个位置的前三个字节作为散列函数的输入来保留散列 table。对于散列冲突,为每个散列值维护一个列表,其中列表的长度取决于压缩级别。哈希值是为当前要压缩的位置的前三个字节计算的。对于每一个哈希值相同且在允许距离内的前面位置,然后直接比较该位置的数据以找到匹配的长度。然后发出找到的最长匹配项,如果没有三个或更多字节的匹配项,则发出文字。