如何告诉 python 中的 zlib 压缩不要使用多个 bytes/characters

How to tell zlib compression in python to NOT use several bytes/characters

在我的研究中,我正在开发一种工具,通过串行连接的无线电通信设备发送任意数据。 PySerial 用于通信

如果我们的有效负载是例如DATA,好像

cmd = b'\x02' + DATA.encode() + b'\x03'

DATA可能很大,通信很慢,所以我尝试使用zlib进行压缩。

from zlib import compress, decompress
DATA_comp = compress(DATA.encode())
cmd = b'\x02' + DATA_comp + b'\x03'

但是 压缩可能会在负载中的某处引入字符 b'\x02' 和 b'\x03'。这会导致错误,因为设备固件将这些视为控制字节!

有没有办法告诉 zlib(或任何其他压缩方法)不要在压缩输出中使用多个字节?

tl;dr:压缩将控制字节引入设备未处理的有效载荷中

我们可以把问题分成两部分:

  1. 压缩数据。
  2. 转换压缩数据,使其不包含字节 3

对于第二部分,您可以使用多种编码。例如,base64 编码不会发出字节 3。更进一步,您可以使用带有有效符号 0-24-255.

的 base255 编码

在@JohnZwinck 的帮助下,我得出了以下结论(以最小工作示例呈现)

from zlib import compress, decompress
from base64 import b64encode
DATA_comp = compress(DATA.encode())
DATA_enc = b64encode(DATA_comp)
cmd = b'\x02' + DATA_enc + b'\x03'

接收端正好相反

正如@Błotosmętek 指出的那样,有效载荷的大小再次增加了一个常数。使用 Ascii85 可能会更好。

如前所述,让 zlib 执行它的操作,然后对生成的位流进行编码以避免禁止的字节。这可以通过将比特流等概率霍夫曼解码到少于所需的 256 个符号的数量来有效和快速地完成。然后在另一端使用霍夫曼编码对该符号流进行编码,将其转换回原始比特流。

为了避免少量字节,您将从流中提取 7 位。根据 7 位的值,拉取或不拉取一位。将 7 位或 8 位映射到所需的字节子集。重复。考虑将零位附加到输入的末尾,以允许使用所有输入位。反向还原,丢弃最后产生的少于8个零位。

示例代码如下:

/*
  avoid.c version 1.0, 2 July 2017

  Copyright (C) 2017 Mark Adler

  This software is provided 'as-is', without any express or implied
  warranty.  In no event will the authors be held liable for any damages
  arising from the use of this software.

  Permission is granted to anyone to use this software for any purpose,
  including commercial applications, and to alter it and redistribute it
  freely, subject to the following restrictions:

  1. The origin of this software must not be misrepresented; you must not
     claim that you wrote the original software. If you use this software
     in a product, an acknowledgment in the product documentation would be
     appreciated but is not required.
  2. Altered source versions must be plainly marked as such, and must not be
     misrepresented as being the original software.
  3. This notice may not be removed or altered from any source distribution.

  Mark Adler
  madler@alumni.caltech.edu
*/

// Take arbitrary binary input and encode it to avoid specified byte values.
// The number of such values to avoid is the parameter "cut". The input is
// taken as a stream of bits. At each step either 7 or 8 bits of input is coded
// to an output byte. As a result, the input bits are expanded by a factor
// between 1 and about 1.143 (rounded up to the next multiple of 8 bits),
// depending on the value of cut and depending on the input data. cut must be
// in the range 1..128. For random input, the average expansion ratio is
// 1/(1-cut/1024).
//
// avoid() does the encoding, and restore() does the decodng. avoid() uses the
// table map[], which maps the values 0..255-cut to the allowed byte values,
// i.e. the byte values that are not cut. invert_map() is provided to invert
// that transformation to make the table unmap[], which is used by restore().

#include <stddef.h>

// Encode input[0..len-1] into a subset of permitted byte values, which number
// cut less than 256. Therefore cut values are cut from the set of possible
// output byte values. map[0..255-cut] is the set of allowed byte values. cut
// must be in the range 1..128. If cut is out of range, zero is returned and no
// encoding is performed. Otherwise the return value is the size of the encoded
// result. size is the size of the output space in bytes, which should be at
// least the maximum possible encoded size, equal to ceiling(len * 8 / 7). The
// return value may be larger than size, in which case only size bytes are
// written to *out, with the remaining encoded data lost. Otherwise the number
// of bytes written to *out is the returned value.
size_t avoid(unsigned char *output, size_t size,
             unsigned char const *input, size_t len,
             unsigned char const *map, unsigned cut) {
    if (len == 0 || cut < 1 || cut > 128)
        return 0;
    unsigned buf = *input, code = buf;
    int bits = 8;
    size_t in = 1, out = 0;
    for (;;) {
        unsigned less = code >> 1;
        if (less < cut) {
            code = less;
            bits -= 7;
        }
        else {
            code -= cut;
            bits -= 8;
        }
        if (out < size)
            output[out] = map[code];
        out++;
        if (in == len && bits <= 0)
            return out;
        if (in < len) {
            if (bits < 8) {
                buf = (buf << 8) + input[in++];
                bits += 8;
            }
            code = buf >> (bits - 8);
        }
        else
            code = buf << (8 - bits);   // pad with zeros
        code &= 0xff;
    }
}

// Invert the map used by avoid() for use by restore().
void invert_map(unsigned char *unmap, unsigned char const *map, unsigned cut) {
    if (cut < 1 || cut > 128)
        return;
    unsigned k = 0;
    do {
        unmap[k++] = 255;
    } while (k < 256);
    k -= cut;
    do {
        k--;
        unmap[map[k]] = k;
    } while (k);
}

// Restore the data input[0..len-1] that was encoded with avoid(), writing the
// restored bytes to *output. The number of restored bytes is returned. size is
// the size of the output space in bytes, which should be at least the maximum
// possible restored size, equal to len. If the returned value is greater than
// size, then only size bytes are written to *output, with the remainder of the
// restored data lost. unmap[k] gives the corresponding code for character k in
// the range 0..255-cut if k is in the allowed set, or 255 if k is not in the
// allowed set. Characters in the input that are not in the allowed set are
// ignored. cut must be in the range 1..128. If cut is out of range, zero is
// returned and no restoration is conducted.
size_t restore(unsigned char *output, size_t size,
               unsigned char const *input, size_t len,
               unsigned char const *unmap, unsigned cut) {
    if (cut < 1 || cut > 128)
        return 0;
    unsigned buf = 0;
    int bits = 0;
    size_t in = 0, out = 0;
    while (in < len) {
        unsigned code = unmap[input[in++]];
        if (code == 255)
            continue;
        if (code < cut) {
            buf <<= 7;
            bits += 7;
        }
        else {
            buf <<= 8;
            bits += 8;
            buf += cut;
        }
        buf += code;
        if (bits >= 8) {
            if (out < size)
                output[out] = buf >> (bits - 8);
            out++;
            bits -= 8;
        }
    }
    return out;
}