为什么String、StringBuffer 和StringBuilder 类 使用字节数组而不是字符数组来存储字符串的字符?

Why String, StringBuffer and StringBuilder classes use byte array instead of character array to store characters of a string?

一个字节不能容纳世界上各种语言的字符的unicode。所以使用字节数组我们不能有不同语言的字符串。为什么这些 类 使用字节数组而不是字符数组?

更新:

class First
{
        public static void main(String[] args)
        {
                System.out.println();
                String s = "\u0935\u0902\u0926\u0947 \u092e\u093e\u0924\u0930\u092e\u094d";
                String s1 = "वंदे मातरम्";
                System.out.println(sb);
                System.out.println(sb1);
        }
}

我认为上面的字符串每个字符占用两个字节。如何将它们容纳在一个字节中?

作为优化,某些虚拟机实现(例如 OpenJDK 9 and up)将仅由 ASCII 编码字符组成的字符串存储在字节数组中,与使用char[].

并且由于 String 通常用于技术性内容(与自然语言相反),大多数程序中的大多数 String 值都符合该描述(即使代码处理的语言 使用 ASCII 编码,例如阿拉伯语或日语)。 HTML 标签、记录器 ID、调试输出和类似的东西总是可以使用这些压缩字符串。

由于没有(官方支持的)方式来实际访问原始数据,并且所有访问都需要通过这些方法,这通常不会导致任何兼容性问题。

A byte cannot accommodate unicodes of characters from various languages of the world. So using a byte array we cannot have a string of different languages.

char 也不能,因为它们只有 16 位。你需要一个 int 。但是每个字符一个 int 感觉太浪费了。

Why these classes use byte array instead of character array ?

以前很少有字符串是关于从口语中提取的单词。它们几乎都是计算机代码,仅使用 ASCII 字符,可以用 7 位编码。每个字符使用 16 位或更多位会感觉非常浪费内存。因此,他们以字节为单位对其进行编码,如果所有字符都是 ASCII,则使用 ASCII,如果某些字符不是,则使用 UTF-16。这样可以在可以的时候节省内存,在不能的时候保持足够好。

byte[]的使用是在Java9中引入的优化。goals/motivation这个改动在JEP 254: Compact Strings中有描述。

Summary

Adopt a more space-efficient internal representation for strings.

Goals

Improve the space efficiency of the String class and related classes while maintaining performance in most scenarios and preserving full compatibility for all related Java and native interfaces.

Non-Goals

It is not a goal to use alternate encodings such as UTF-8 in the internal representation of strings. A subsequent JEP may explore that approach.

Motivation

The current implementation of the String class stores characters in a char array, using two bytes (sixteen bits) for each character. Data gathered from many different applications indicates that strings are a major component of heap usage and, moreover, that most String objects contain only Latin-1 characters. Such characters require only one byte of storage, hence half of the space in the internal char arrays of such String objects is going unused.

Description

We propose to change the internal representation of the String class from a UTF-16 char array to a byte array plus an encoding-flag field. The new String class will store characters encoded either as ISO-8859-1/Latin-1 (one byte per character), or as UTF-16 (two bytes per character), based upon the contents of the string. The encoding flag will indicate which encoding is used.

String-related classes such as AbstractStringBuilder, StringBuilder, and StringBuffer will be updated to use the same representation, as will the HotSpot VM's intrinsic string operations.

This is purely an implementation change, with no changes to existing public interfaces. There are no plans to add any new public APIs or other interfaces.

The prototyping work done to date confirms the expected reduction in memory footprint, substantial reductions of GC activity, and minor performance regressions in some corner cases.

实际上 Java 9 中的字符串 class 和 Java 的更高版本可以根据每个字符的内容使用字节数组的 1 字节或 2 字节字符串。 String.java

中有一个字段
private final byte coder;

决定字符串中字符使用的编码(LATIN1 或 UTF16)。