javac 如何处理 Unicode 字形?
How does javac process Unicode glyphs?
我尝试了 System.out.println("ñ");
,它打印了 ñ
。为什么javac 运行 没有通过错误?
Javac 可以配置为具有源文件编码。这样,您就可以将字符文字(和符号名称!)与非 ASCII 字符一起使用。
如果与文件编码实际匹配,则一切正常。
如果不是,您可能会遇到错误,但更可能的是,只是一些损坏的字符串。
为了再次打印文本,程序还需要知道在打印时使用哪种编码。所有这些都需要正确配置(Java中的默认值不可移植),否则你会得到各种破碎的文本输出。
Java char
和 String
原生为 UTF-16。它可以处理 'ñ' 和 "ñ"。
JLS-3.1. Unicode 说(部分),
The Java programming language represents text in sequences of 16-bit code units, using the UTF-16 encoding.
JLS-3.2. Lexical Structure 对此进行了扩展,
A raw Unicode character stream is translated into a sequence of tokens, using the following three lexical translation steps, which are applied in turn:
A translation of Unicode escapes (§3.3) in the raw stream of Unicode characters to the corresponding Unicode character. A Unicode escape of the form \uxxxx, where xxxx is a hexadecimal value, represents the UTF-16 code unit whose encoding is xxxx. This translation step allows any program to be expressed using only ASCII characters.
A translation of the Unicode stream resulting from step 1 into a stream of input characters and line terminators (§3.4).
A translation of the stream of input characters and line terminators resulting from step 2 into a sequence of input elements (§3.5) which, after white space (§3.6) and comments (§3.7) are discarded, comprise the tokens (§3.5) that are the terminal symbols of the syntactic grammar (§2.3).
我尝试了 System.out.println("ñ");
,它打印了 ñ
。为什么javac 运行 没有通过错误?
Javac 可以配置为具有源文件编码。这样,您就可以将字符文字(和符号名称!)与非 ASCII 字符一起使用。
如果与文件编码实际匹配,则一切正常。
如果不是,您可能会遇到错误,但更可能的是,只是一些损坏的字符串。
为了再次打印文本,程序还需要知道在打印时使用哪种编码。所有这些都需要正确配置(Java中的默认值不可移植),否则你会得到各种破碎的文本输出。
Java char
和 String
原生为 UTF-16。它可以处理 'ñ' 和 "ñ"。
JLS-3.1. Unicode 说(部分),
The Java programming language represents text in sequences of 16-bit code units, using the UTF-16 encoding.
JLS-3.2. Lexical Structure 对此进行了扩展,
A raw Unicode character stream is translated into a sequence of tokens, using the following three lexical translation steps, which are applied in turn:
A translation of Unicode escapes (§3.3) in the raw stream of Unicode characters to the corresponding Unicode character. A Unicode escape of the form \uxxxx, where xxxx is a hexadecimal value, represents the UTF-16 code unit whose encoding is xxxx. This translation step allows any program to be expressed using only ASCII characters.
A translation of the Unicode stream resulting from step 1 into a stream of input characters and line terminators (§3.4).
A translation of the stream of input characters and line terminators resulting from step 2 into a sequence of input elements (§3.5) which, after white space (§3.6) and comments (§3.7) are discarded, comprise the tokens (§3.5) that are the terminal symbols of the syntactic grammar (§2.3).