为什么用 System.in 读取没有 ASCII 表示的字符不会给出两个字节的字符？

Question

import java.io.IOException;

public class Main {

    public static void main(String[] args) throws IOException {

        char ch = '诶';

        System.out.println((int)ch);

        int c;
        while ((c = System.in.read()) != -1)
        {
            System.out.println(c);
        }
    }
}

输出：

这里，表示字符诶在unicode中的值为35830。在二进制中，它将是 10001011 11110110.

当我在终端中输入该字符时，我希望得到两个字节，10001011 和 11110110。当再次组合它们时，我可以获得原始字符。

但我实际得到的是：

我可以看到 10 代表换行符。但是前 3 个数字是什么意思？

Answer 1

UTF-8 是一种多字节可变长度编码。

为了让读取字节流的东西知道还有更多字节要读取以完成当前代码点，有些值不能出现在有效的 UTF-8 字节流中。基本上，某些模式表示 "hang on, I'm not done".

有一个 table 解释了它 here。对于 U+0800 到 U+FFFF 范围内的代码点，需要 16 位来表示；它的字节表示由 3 个字节组成：

1st byte    2nd byte    3rd byte
1110xxxx    10xxxxxx    10xxxxxx

您看到的是 232 175 182，因为它们是 UTF-8 编码的字节。

byte[] bytes = "诶".getBytes(StandardCharsets.UTF_8);
for (byte b : bytes) {
  System.out.println((0xFF & b) + " " + Integer.toString(0xFF & b, 2));
}

Ideone demo

输出：

232 11101000
175 10101111
182 10110110

因此 3 个字节遵循上述模式。

为什么用 System.in 读取没有 ASCII 表示的字符不会给出两个字节的字符？

Why reading a character that has no ASCII representation with System.in doesn't give the character in two bytes?

java

system.in

character-encoding