MacOS 和 Windows 中相同字符的不同代码点

Question

我有一小段代码，我在其中检查字符 Ü.

的代码点

Locale lc = Locale.getDefault();
System.out.println(lc.toString());
System.out.println(Charset.defaultCharset());
System.out.println(System.getProperty("file.encoding"));
String inUnicode = "\u00dc";
String glyph = "Ü";
System.out.println("inUnicode " + inUnicode + " code point " + inUnicode.codePointAt(0));
System.out.println("glyph " + glyph + " code point " + glyph.codePointAt(0));

当我在 Mac运行 x 和 Windows 10 上运行这段代码时，我得到了不同的代码点值，请参阅下面的输出。

在 Mac 上输出OS

en_US
UTF-8
UTF-8
inUnicode Ü code point 220
glyph Ü code point 220

Windows

上的输出

en_US
windows-1252
Cp1252
in unicode Ü code point 220
glyph ?? code point 195

我在 https://en.wikipedia.org/wiki/Windows-1252#Character_set 检查了 windows-1252 的代码页，这里 Ü 的代码点是 220。对于 String glyph = "Ü";，为什么我在 Windows 上得到代码点为 195？根据我的理解 glyph 应该已经正确呈现并且代码点应该是 220 因为它是在 Windows-1252.

中定义的

如果我将 String glyph = "Ü"; 替换为 String glyph = new String("Ü".getBytes(), Charset.forName("UTF-8"));，则 glyph 会正确呈现并且代码点值为 220。这是标准化 String 在任何 OS 上的行为而不考虑语言环境和字符集的正确且有效的方法吗？

Answer 1

195 是十六进制的 0xC3。

在 UTF-8 中，Ü 被编码为字节 0xC3 0x9C。

System.getProperty("file.encoding") 表示 Windows 上的默认文件编码不是 UTF-8，但显然您的 Java 文件实际上是用 UTF-8 编码的。事实上 println() 正在输出 glyph ??（注意 2 ?，意思是存在 2 char），并且您能够使用 UTF 解码原始字符串字节-8 Charset，证明了这一点。

glyph 应该有一个值为 0x00DC 的 char，而不是值为 0x00C3 0x009C 的 2 个 char。 getCodepointAt(0) 在 Windows 上返回 0x00C3 (195)，因为您的 Java 文件是用 UTF-8 编码的，但加载时就好像它是用 Windows 编码的一样-1252，所以 2 个字节 0xC3 0x9C 被解码为字符 0x00C3 0x009C 而不是字符 0x00DC.

运行Java时需要指定实际的文件编码，eg:

java -Dfile.encoding=UTF-8 ...

MacOS 和 Windows 中相同字符的不同代码点

Different codepoints for same character in MacOS and Windows

java

string

unicode

utf-8

windows-1252