Java 9 中紧凑字符串和压缩字符串之间的区别

Question

compact strings 相对于 JDK9 中的压缩字符串有什么优势？

Answer 1

XX:+UseCompressedStrings 和 Compact Strings 是不同的东西。

UseCompressedStrings 意味着只能将 ASCII 字符串转换为 byte[]，但这在默认情况下是关闭的。在 jdk-9 中，此优化始终开启，但不是通过标志本身，而是内置。

直到 java-9 字符串在内部存储为 UTF-16 编码的 char[]。从 java-9 起，它们将存储为 byte[]。为什么？

因为在 ISO_LATIN_1 中，每个字符都可以用一个字节（8 位）编码，而不是以前使用的字节（16 位，每个 8 位从未使用过）。这仅对 ISO_LATIN_1 有效，但这是大多数使用的字符串。

这样就完成了 space 用法。

这里有一个小例子，可以使事情更清楚：

class StringCharVsByte {
    public static void main(String[] args) {
        String first = "first";
        String russianFirst = "первыи";

        char[] c1 = first.toCharArray();
        char[] c2 = russianFirst.toCharArray();

        for (char c : c1) {
            System.out.println(c >>> 8);
        }

        for (char c : c2) {
            System.out.println(c >>> 8);
        }
    }
}

在第一种情况下，我们将只得到零，这意味着最高有效的 8 位是零；在第二种情况下，将有一个非零值，这意味着存在最高有效 8 中的至少一位。

这意味着如果我们在内部将字符串存储为字符数组，则字符串文字实际上浪费了每个字符的一半。事实证明，有多个应用程序实际上因此浪费了很多space。

您有一个由 10 个 Latin1 字符组成的字符串吗？您刚刚丢失了 80 位或 10 个字节。为了减轻这种字符串压缩。现在，这些字符串不会有 space 损失。

在内部这也意味着一些非常好的事情。为了区分 LATIN1 和 UTF-16 的字符串，有一个字段 coder:

/**
 * The identifier of the encoding used to encode the bytes in
 * {@code value}. The supported values in this implementation are
 *
 * LATIN1
 * UTF16
 *
 * @implNote This field is trusted by the VM, and is a subject to
 * constant folding if String instance is constant. Overwriting this
 * field after construction will cause problems.
 */
private final byte coder;

现在基于此length计算不同：

public int length() {
    return value.length >> coder();
}

如果我们的字符串仅为 Latin1，则编码器将为零，因此值（字节数组）的长度就是字符的大小。对于非 Latin1 除以二。

Answer 2

紧凑型字符串将两全其美。

从 OpenJDK 文档中提供的定义可以看出：

The new String class will store characters encoded either as ISO-8859-1/Latin-1 (one byte per character), or as UTF-16 (two bytes per character), based upon the contents of the string. The encoding flag will indicate which encoding is used.

正如@Eugene 所提到的，大多数字符串都以 Latin-1 格式编码，每个字符需要一个字节，因此不需要当前字符串 [=39] 中提供的整个 2 字节 space =]实施。

新的字符串 class 实现将从 UTF-16 char array 转移到 a byte array 加上一个编码标志字段 。附加的 编码字段 将显示字符是使用 UTF-16 还是 Latin-1 格式存储的。

这也得出结论，如果需要，我们也将能够以 UTF-16 格式存储字符串。而这也成为了Compressed String of Java 6和Compact String of Java 9的主要区别与压缩字符串一样，只有字节[]数组用于存储，然后表示为纯ASCII。

Answer 3

压缩字符串 (Java 6) 和压缩字符串 (Java 9) 都有相同的动机（字符串通常实际上是 Latin-1，所以 space 的一半被浪费了) 和目标（使这些字符串变小），但实现方式差异很大。

压缩字符串

在 an interview 中，Aleksey Shipilëv（负责实施 Java 9 功能）对压缩字符串有这样的说法：

UseCompressedStrings feature was rather conservative: while distinguishing between char[] and byte[] case, and trying to compress the char[] into byte[] on String construction, it done most String operations on char[], which required to unpack the String. Therefore, it benefited only a special type of workloads, where most strings are compressible (so compression does not go to waste), and only a limited amount of known String operations are performed on them (so no unpacking is needed). In great many workloads, enabling -XX:+UseCompressedStrings was a pessimization.

[...] UseCompressedStrings implementation was basically an optional feature that maintained a completely distinct String implementation in alt-rt.jar, which was loaded once the VM option is supplied. Optional features are harder to test, since they double the number of option combinations to try.

压缩字符串

另一方面，在 Java 9 中，压缩字符串完全集成到 JDK 源代码中。 String 总是由 byte[] 支持，其中字符如果是 Latin-1 则使用一个字节，否则使用两个字节。大多数操作都会检查是哪种情况，例如charAt:

public char charAt(int index) {
    if (isLatin1()) {
        return StringLatin1.charAt(value, index);
    } else {
        return StringUTF16.charAt(value, index);
    }
}

压缩字符串默认启用并且可以部分禁用 - "partially" 因为它们仍然由 byte[] 支持并且返回 chars 的操作仍然必须将它们从两个单独的字节（由于内在函数，很难说这是否会对性能产生影响）。

Java 9 中紧凑字符串和压缩字符串之间的区别

Difference between compact strings and compressed strings in Java 9

java

string

java-9

压缩字符串

压缩字符串

更多