如何在不丢失信息的情况下保存字符串字节？

Question

我正在开发 JPEG 解码器（我正处于 Huffman 阶段），我想将 BinaryString 写入文件。 例如，假设我们有这个：

String huff = "00010010100010101000100100";

我尝试将它转换为一个整数，将它除以 8 并保存它的整数表示，因为我不会写位：

huff.split("(?<=\G.{8})"))
int val = Integer.parseInt(str, 2);
out.write(val); //writes to a FileOutputStream

问题是，在我的示例中，如果我尝试保存 "00010010"，它会将其转换为 18 (10010) , 我需要 0。

最后，当我阅读时：

int enter;
String code = "";
    while((enter =in.read())!=-1) {
            code+=Integer.toBinaryString(enter);
        }

我得到了：

Code = 10010

而不是：

Code = 00010010

我也试过将它转换为 bitset，然后再转换为 Byte[]，但我遇到了同样的问题。

Answer 1

您可能想看看 UTF-8 算法，因为它完全符合您的要求。它存储大量数据，同时丢弃零，保留相关数据并对其进行编码以占用更少的磁盘space。

Works with: Java version 7+

import java.nio.charset.StandardCharsets;
import java.util.Formatter;

public class UTF8EncodeDecode {

    public static byte[] utf8encode(int codepoint) {
        return new String(new int[]{codepoint}, 0, 1).getBytes(StandardCharsets.UTF_8);
    }

    public static int utf8decode(byte[] bytes) {
        return new String(bytes, StandardCharsets.UTF_8).codePointAt(0);
    }

    public static void main(String[] args) {
        System.out.printf("%-7s %-43s %7s\t%s\t%7s%n",
                "Char", "Name", "Unicode", "UTF-8 encoded", "Decoded");

        for (int codepoint : new int[]{0x0041, 0x00F6, 0x0416, 0x20AC, 0x1D11E}) {
            byte[] encoded = utf8encode(codepoint);
            Formatter formatter = new Formatter();
            for (byte b : encoded) {
                formatter.format("%02X ", b);
            }
            String encodedHex = formatter.toString();
            int decoded = utf8decode(encoded);
            System.out.printf("%-7c %-43s U+%04X\t%-12s\tU+%04X%n",
                    codepoint, Character.getName(codepoint), codepoint, encodedHex, decoded);
        }
    }
}

https://rosettacode.org/wiki/UTF-8_encode_and_decode#Java

UTF-8 is a variable width character encoding capable of encoding all 1,112,064[nb 1] valid code points in Unicode using one to four 8-bit bytes.[nb 2] The encoding is defined by the Unicode Standard, and was originally designed by Ken Thompson and Rob Pike.[1][2] The name is derived from Unicode (or Universal Coded Character Set) Transformation Format – 8-bit.[3]

It was designed for backward compatibility with ASCII. Code points with lower numerical values, which tend to occur more frequently, are encoded using fewer bytes. The first 128 characters of Unicode, which correspond one-to-one with ASCII, are encoded using a single byte with the same binary value as ASCII, so that valid ASCII text is valid UTF-8-encoded Unicode as well. Since ASCII bytes do not occur when encoding non-ASCII code points into UTF-8, UTF-8 is safe to use within most programming and document languages that interpret certain ASCII characters in a special way, such as "/" (slash) in filenames, "\" (backslash) in escape sequences, and "%" in printf.

https://en.wikipedia.org/wiki/UTF-8

二进制 11110000 10010000 10001101 10001000 在 UTF-8 中变为 F0 90 8D 88。由于您将其存储为文本，因此您从必须存储 32 个字符变成了存储 8 个字符。而且由于它是一种众所周知且设计良好的编码，因此您可以轻松地将其反转。所有的数学都为你完成。

您的示例 00010010100010101000100100（或者更确切地说 00000001 0010100 0101010 00100100）转换为 *$（在我的机器上是两个不可打印的字符）。那是二进制文件的 UTF-8 编码。我错误地使用了一个不同的站点，该站点使用我输入的十进制数据而不是二进制数据。

https://onlineutf8tools.com/convert-binary-to-utf8

有关 UTF-8 及其如何应用于答案的真正好的解释：

https://hackaday.com/2013/09/27/utf-8-the-most-elegant-hack/

编辑：

我把这个问题作为一种减少存储值所需字符数量的方法，这是一种编码。 UTF-8 是一种编码。以 "non-standard" 的方式使用，OP 可以使用 UTF-8 以更短的格式对其 0 和 1 的字符串进行编码。这就是这个答案的相关性。

如果您连接这些字符，您可以轻松地从 4x 8 位（32 位）变为 8x 8 位（64 位），并编码一个大至 9,223,372,036,854,775,807 的值。

Answer 2

你的例子是你有字符串 "10010" 并且你想要字符串 "00010010"。也就是说，您需要用零填充此字符串。请注意，由于您要在一个循环中加入对 Integer.toBinaryString 的多次调用的结果，因此在连接它们之前，您需要在循环中用左键填充这些字符串。

while((enter = in.read()) != -1) {
    String binary = Integer.toBinaryString(enter);
    // left-pad to length 8
    binary = ("00000000" + binary).substring(binary.length());
    code += binary;
}

如何在不丢失信息的情况下保存字符串字节？

How can I save a String Byte without losing information?

java

string

int

byte

bitset