如何在 python 中使用 gzip 将 header 添加到压缩字符串?

How can I add header to compressed string with gzip in python?

我正在尝试通过 python 压缩字符串,就像特定的 C# 代码一样,但我得到了不同的结果。似乎我必须向压缩结果添加 header,但我不知道如何向 python 中的压缩字符串添加 header。这是我不知道 python:

中的 C# 行
memoryStream.Read(compressedBytes, CompressedMessageHeaderLength, (int)memoryStream.Length);

这是完整的可运行 C# 代码

using System;
using System.IO;
using System.IO.Compression;
using System.Text;

namespace Rextester
{
    /// <summary>Handles compressing and decompressing API requests and responses.</summary>
    public class Compression
    {
        #region Member Variables
        /// <summary>The compressed message header length.</summary>
        private const int CompressedMessageHeaderLength = 4;
        #endregion

        #region Methods
        /// <summary>Compresses the XML string.</summary>
        /// <param name="documentToCompress">The XML string to compress.</param>
        public static string CompressData(string data)
        {
            using (MemoryStream memoryStream = new MemoryStream())
            {
                byte[] plainBytes = Encoding.UTF8.GetBytes(data);

                using (GZipStream zipStream = new GZipStream(memoryStream, CompressionMode.Compress, leaveOpen: true))
                {
                    zipStream.Write(plainBytes, 0, plainBytes.Length);
                }

                memoryStream.Position = 0;

                byte[] compressedBytes = new byte[memoryStream.Length + CompressedMessageHeaderLength];

                Buffer.BlockCopy(
                    BitConverter.GetBytes(plainBytes.Length),
                    0,
                    compressedBytes,
                    0,
                    CompressedMessageHeaderLength
                );

                // Add the header, which is the length of the compressed message.
                memoryStream.Read(compressedBytes, CompressedMessageHeaderLength, (int)memoryStream.Length);

                string compressedXml = Convert.ToBase64String(compressedBytes);

                return compressedXml;
            }
        }
        
 
        #endregion
    }

    public class Program
    {
        public static void Main(string[] args)
        {
            //Your code goes here
            string data = "Hello World!";
            Console.WriteLine(  Compression.CompressData(data) );
            // result would be DAAAAB+LCAAAAAAABADzSM3JyVcIzy/KSVEEAKMcKRwMAAAA

        }
    }
}

这是我写的 Python 代码:

data = 'Hello World!'

import gzip
import base64
print(base64.b64encode(gzip.compress(data.encode('utf-8'))))

# I expect DAAAAB+LCAAAAAAABADzSM3JyVcIzy/KSVEEAKMcKRwMAAAA 
# but I get H4sIACwuuWAC//NIzcnJVwjPL8pJUQQAoxwpHAwAAAA=

您可以使用to_bytes转换编码字符串的长度:

enc = data.encode('utf-8')
zipped = gzip.compress(enc)
print(base64.b64encode((len(enc)).to_bytes(4, sys.byteorder) + zipped)) # sys.byteorder can be set to concrete fixed value

此外,gzip.compress(enc) 生成的结果似乎与 C# 对应的结果略有不同(因此总体结果也会有所不同)但这应该不是问题,因此解压缩应该正确处理所有内容。

正如其他人所提到的,您将 header 放入 c# 版本的事实是不同的。

另外,请注意 gzip 过程可以通过多种方式完成。例如,在 C# 中,您可以指定 OptimalFastestNoCompressionCompressionLevel。参见:https://docs.microsoft.com/en-us/dotnet/api/system.io.compression.compressionlevel?view=net-5.0

我对 Python 不够熟悉,无法说明默认情况下它将如何处理 gzip 压缩(也许 C# 中的 Fastest 提供或多或少比 Python 更激进的算法)

这是您的 C# 代码,header 值设置为“0”,输出为 3 CompressionLevels。请注意,它输出一个字符串值 'pretty close' 到您在 Python.

中得到的值

您还应该问一下值不同是否真的很重要。只要能编解码就够了?

using System;
using System.IO;
using System.IO.Compression;
using System.Text;

public class Program
{
    public static void Main()
    {
        string data = "Hello World!";
        Console.WriteLine(  Compression.CompressData(data, CompressionLevel.Fastest) );
        Console.WriteLine(  Compression.CompressData(data, CompressionLevel.NoCompression) );
        Console.WriteLine(  Compression.CompressData(data, CompressionLevel.Optimal) );
        // result would be DAAAAB+LCAAAAAAABADzSM3JyVcIzy/KSVEEAKMcKRwMAAAA
        // but I get       H4sIACwuuWAC//NIzcnJVwjPL8pJUQQAoxwpHAwAAAA=
    }
}

public class Compression
    {
        #region Member Variables
        /// <summary>The compressed message header length.</summary>
        private const int CompressedMessageHeaderLength = 0; // changed to zero
        #endregion

        #region Methods
        /// <summary>Compresses the XML string.</summary>
        /// <param name="documentToCompress">The XML string to compress.</param>
        public static string CompressData(string data, CompressionLevel compressionLevel)
        {
            using (MemoryStream memoryStream = new MemoryStream())
            {
                byte[] plainBytes = Encoding.UTF8.GetBytes(data);

                using (GZipStream zipStream = new GZipStream(memoryStream, compressionLevel, leaveOpen: true))
                {
                    zipStream.Write(plainBytes, 0, plainBytes.Length);
                }

                memoryStream.Position = 0;

                byte[] compressedBytes = new byte[memoryStream.Length + CompressedMessageHeaderLength];

                Buffer.BlockCopy(
                    BitConverter.GetBytes(plainBytes.Length),
                    0,
                    compressedBytes,
                    0,
                    CompressedMessageHeaderLength
                );

                // Add the header, which is the length of the compressed message.
                memoryStream.Read(compressedBytes, CompressedMessageHeaderLength, (int)memoryStream.Length);

                string compressedXml = Convert.ToBase64String(compressedBytes);

                return compressedXml;
            }
        }
        
 
        #endregion
    }

输出:

H4sIAAAAAAAEA/NIzcnJVwjPL8pJUQQAoxwpHAwAAAA= H4sIAAAAAAAEAwEMAPP/SGVsbG8gV29ybGQhoxwpHAwAAAA= H4sIAAAAAAAAA/NIzcnJVwjPL8pJUQQAoxwpHAwAAAA=

并在:https://dotnetfiddle.net/TI8gwM

我要开始的一件事是 C# 代码不 well-suited 用于跨平台使用。长度 header 的字节顺序取决于底层架构,因为 BitConverter.GetBytes returns 字节的顺序与架构无关。

但是,对于 C#,我们可能指的是 windows,这可能意味着 Intel,所以 Little Endian 很有可能。

因此,您需要做的是将原始数据的长度按照小端顺序添加到压缩后的数据中。恰好 4 个字节。

bdata = data.encode('utf-8')
compressed = gzip.compress(bdata)
header = len(bdata).to_bytes(4,'little')

然后,您需要连接并转换为base64:

print(base64.b64encode(header + compressed))