Windows 上的字符编码混乱
Character encoding confusion on Windows
我有一个简单的 Java 程序,它接受十六进制并将其转换为 ASCII。
使用 Java 8,我编译了以下内容:
import java.nio.charset.Charset;
import java.util.Scanner;
public class Main
{
public static void main(String[] args)
{
System.out.println("Charset: " + Charset.defaultCharset());
Scanner in = new Scanner(System.in);
System.out.print("Type a HEX string: ");
String s = in.nextLine();
String asciiStr = new String();
// Split the string into an array
String[] hexes = s.split(":");
// For each hex
for (String hex : hexes) {
// Translate the hex to ASCII
System.out.print(" " + Integer.parseInt(hex, 16) + "|" + (char)Integer.parseInt(hex, 16));
asciiStr += ((char) Integer.parseInt(hex, 16));
}
System.out.println("\nthe ASCII string is " + asciiStr);
in.close();
}
}
我正在向程序传递 C0:A8:96:FE
的十六进制字符串。我主要关心的是 0x96
值,因为它被定义为控制字符(128 - 159 范围内的字符)。
当我 运行 程序没有任何 JVM 标志时的输出如下:
Charset: windows-1252
Type a HEX string: C0:A8:96:FE
192|À 168|¨ 150|? 254|þ
the ASCII string is À¨?þ
当我使用 JVM 标志 -Dfile.encoding=ISO-8859-1
设置字符编码时的输出如下所示:
Charset: ISO-8859-1
Type a HEX string: C0:A8:96:FE
192|À 168|¨ 150|– 254|þ
the ASCII string is À¨–þ
我想知道为什么当字符编码设置为 ISO-8859-1 时,字符 128 - 159 会得到额外的 Windows-1252 个字符?这些字符不应在 ISO-8859-1 中定义,而应在 Windows-1252 中定义,但它在这里似乎是倒退的。在 ISO-8859-1 中,我认为 0x96
字符应该被编码为空白字符,但事实并非如此。相反,Windows-1252 编码会执行此操作,而它应该将其正确编码为 –
。有什么帮助吗?
tl;博士
我的猜测:虽然您的 JVM 的默认 Charset
可能是“windows-1252”,但您的 System.out
实际上使用的是 Unicode。
你说:
when I use the JVM flag -Dfile.encoding=ISO-8859-1 to set the character encoding
我下面的实验让我怀疑您所做的任何事情 不会 实际上影响 System.out
使用的字符集。我相信在你的 运行 中,当你认为你的 System.out
使用“windows-1252”或“ISO-8859-1”时,你的 System.out
在事实上使用 Unicode,可能是 UTF-8。
我希望我知道 how to get the Charset
of System.out
。
此行为将来可能会改变,proposal (JEP 400) to use UTF-8 by default 跨平台。
详情
其实你问的是Unicode rather than ASCII。 ASCII 只有 128 个字符。
你说:
My main concern is the 0x96 value, because it is defined as a control character (characters in the range of 128 - 159).
实际上,控制字符的范围在 Unicode(和 ASCII)中从 127 开始,而不是 128。代码点 127 是 DELETE character。所以127-159是控制字符。
首先,让我们拆分您输入的十六进制代码字符串。
final List < String > hexInputs = List.of( "C0:A8:96:FE".split( ":" ) );
System.out.println( "hexInputs = " + hexInputs );
当运行.
hexInputs = [C0, A8, 96, FE]
现在将每个十六进制文本转换为十六进制整数。我们将该整数用作 Unicode code point.
与其依赖某些默认字符编码,不如明确设置 System.out
的 Charset
。我不是这方面的专家,但一些网络搜索发现了下面的代码,我们将 System.out
包装在一个新的 PrintStream
中,同时通过其名称设置 Charset
。我找不到 获取 PrintStream
的 Charset
的方法,所以 I asked.
UTF-8
// UTF-8
System.out.println( "----------| UTF-8 |--------------------------" );
try
{
PrintStream printStream = new PrintStream( System.out , true , StandardCharsets.UTF_8.name() ); // "UTF-8".
for ( String hexInput : hexInputs )
{
int codePoint = Integer.parseInt( hexInput , 16 );
String string = Character.toString( codePoint );
printStream.println( "hexInput: " + hexInput + " = codePoint: " + codePoint + " = string: [" + string + "] = isLetter: " + Character.isLetter( codePoint ) + " = name: " + Character.getName( codePoint ) );
}
}
catch ( UnsupportedEncodingException e )
{
e.printStackTrace();
}
当运行.
----------| UTF-8 |--------------------------
hexInput: C0 = codePoint: 192 = string: [À] = isLetter: true = name: LATIN CAPITAL LETTER A WITH GRAVE
hexInput: A8 = codePoint: 168 = string: [¨] = isLetter: false = name: DIAERESIS
hexInput: 96 = codePoint: 150 = string: [] = isLetter: false = name: START OF GUARDED AREA
hexInput: FE = codePoint: 254 = string: [þ] = isLetter: true = name: LATIN SMALL LETTER THORN
Windows-1252
接下来,我们做同样的事情,但将 "windows-1252"
设置为我们包装的 System.out
的 Charset
。在进行包装之前,我们验证这样的字符编码在我们当前的 JVM 上确实可用。
// windows-1252
System.out.println( "----------| windows-1252 |--------------------------" );
// Verify windows-1252 charset is available on the current JVM.
String windows1252CharSetName = "windows-1252";
boolean isWindows1252CharsetAvailable = Charset.availableCharsets().keySet().contains( windows1252CharSetName );
if ( isWindows1252CharsetAvailable )
{
System.out.println( "isWindows1252CharsetAvailable = " + isWindows1252CharsetAvailable );
} else
{
System.out.println( "FAIL - No charset available for name: " + windows1252CharSetName );
}
try
{
PrintStream printStream = new PrintStream( System.out , true , windows1252CharSetName );
for ( String hexInput : hexInputs )
{
int codePoint = Integer.parseInt( hexInput , 16 );
String string = Character.toString( codePoint );
printStream.println( "hexInput: " + hexInput + " = codePoint: " + codePoint + " = string: [" + string + "] = isLetter: " + Character.isLetter( codePoint ) + " = name: " + Character.getName( codePoint ) );
}
}
catch ( UnsupportedEncodingException e )
{
e.printStackTrace();
}
当运行.
----------| windows-1252 |--------------------------
isWindows1252CharsetAvailable = true
hexInput: C0 = codePoint: 192 = string: [�] = isLetter: true = name: LATIN CAPITAL LETTER A WITH GRAVE
hexInput: A8 = codePoint: 168 = string: [�] = isLetter: false = name: DIAERESIS
hexInput: 96 = codePoint: 150 = string: [?] = isLetter: false = name: START OF GUARDED AREA
hexInput: FE = codePoint: 254 = string: [�] = isLetter: true = name: LATIN SMALL LETTER THORN
Latin-1
我们也可以尝试 Latin-1,产生不同的结果。
// ISO-8859-1
System.out.println( "----------| Latin-1 |--------------------------" );
// Verify that charset is available on the current JVM.
String latin1CharsetName = "ISO-8859-1"; // Also known as "Latin-1".
boolean isLatin1CharsetNameAvailable = Charset.availableCharsets().keySet().contains( latin1CharsetName );
if ( isLatin1CharsetNameAvailable )
{
System.out.println( "isLatin1CharsetNameAvailable = " + isLatin1CharsetNameAvailable );
} else
{
System.out.println( "FAIL - No charset available for name: " + latin1CharsetName );
}
try
{
PrintStream printStream = new PrintStream( System.out , true , latin1CharsetName );
for ( String hexInput : hexInputs )
{
int codePoint = Integer.parseInt( hexInput , 16 );
String string = Character.toString( codePoint );
printStream.println( "hexInput: " + hexInput + " = codePoint: " + codePoint + " = string: [" + string + "] = isLetter: " + Character.isLetter( codePoint ) + " = name: " + Character.getName( codePoint ) );
}
}
catch ( UnsupportedEncodingException e )
{
e.printStackTrace();
}
当运行.
----------| Latin-1 |--------------------------
isLatin1CharsetNameAvailable = true
hexInput: C0 = codePoint: 192 = string: [�] = isLetter: true = name: LATIN CAPITAL LETTER A WITH GRAVE
hexInput: A8 = codePoint: 168 = string: [�] = isLetter: false = name: DIAERESIS
hexInput: 96 = codePoint: 150 = string: [�] = isLetter: false = name: START OF GUARDED AREA
hexInput: FE = codePoint: 254 = string: [�] = isLetter: true = name: LATIN SMALL LETTER THORN
结论
所以你可以看到,当对我们包装的 System.out
的 Charset
进行硬编码时,我们确实看到了差异。使用 UTF-8,我们得到实际字符 [À], [¨], [], [þ]
,而使用 windows-1252,我们得到三个时髦的问号字符和一个常规问号 [�], [�], [?], [�]
。请记住,我们在代码中添加了方括号。
我的代码的这种行为符合我的预期,显然也符合您的预期。这四个 hex/decimal 整数中的两个是 Unicode 中的字母,而其中 none 是 Windows 1252 字符集或 Latin-1 中的字母。对我来说唯一神秘的是,十六进制 96 十进制 150 数字有两种不同的表示形式,一个空 space 与 UTF-8 而一个问号与 windows-1252,然后是一个时髦的问题-在 Latin-1 下标记。
结论:您的 System.out
没有使用您认为正在使用的 Charset
。我怀疑虽然 JVM’s default Charset
of your JVM may be named "windows-1252", your System.out
is actually the Unicode character set, likely with UTF-8 编码。
注意reader:如果不熟悉字符集和字符编码,推荐有趣易读的post、The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).
我有一个简单的 Java 程序,它接受十六进制并将其转换为 ASCII。 使用 Java 8,我编译了以下内容:
import java.nio.charset.Charset;
import java.util.Scanner;
public class Main
{
public static void main(String[] args)
{
System.out.println("Charset: " + Charset.defaultCharset());
Scanner in = new Scanner(System.in);
System.out.print("Type a HEX string: ");
String s = in.nextLine();
String asciiStr = new String();
// Split the string into an array
String[] hexes = s.split(":");
// For each hex
for (String hex : hexes) {
// Translate the hex to ASCII
System.out.print(" " + Integer.parseInt(hex, 16) + "|" + (char)Integer.parseInt(hex, 16));
asciiStr += ((char) Integer.parseInt(hex, 16));
}
System.out.println("\nthe ASCII string is " + asciiStr);
in.close();
}
}
我正在向程序传递 C0:A8:96:FE
的十六进制字符串。我主要关心的是 0x96
值,因为它被定义为控制字符(128 - 159 范围内的字符)。
当我 运行 程序没有任何 JVM 标志时的输出如下:
Charset: windows-1252
Type a HEX string: C0:A8:96:FE
192|À 168|¨ 150|? 254|þ
the ASCII string is À¨?þ
当我使用 JVM 标志 -Dfile.encoding=ISO-8859-1
设置字符编码时的输出如下所示:
Charset: ISO-8859-1
Type a HEX string: C0:A8:96:FE
192|À 168|¨ 150|– 254|þ
the ASCII string is À¨–þ
我想知道为什么当字符编码设置为 ISO-8859-1 时,字符 128 - 159 会得到额外的 Windows-1252 个字符?这些字符不应在 ISO-8859-1 中定义,而应在 Windows-1252 中定义,但它在这里似乎是倒退的。在 ISO-8859-1 中,我认为 0x96
字符应该被编码为空白字符,但事实并非如此。相反,Windows-1252 编码会执行此操作,而它应该将其正确编码为 –
。有什么帮助吗?
tl;博士
我的猜测:虽然您的 JVM 的默认 Charset
可能是“windows-1252”,但您的 System.out
实际上使用的是 Unicode。
你说:
when I use the JVM flag -Dfile.encoding=ISO-8859-1 to set the character encoding
我下面的实验让我怀疑您所做的任何事情 不会 实际上影响 System.out
使用的字符集。我相信在你的 运行 中,当你认为你的 System.out
使用“windows-1252”或“ISO-8859-1”时,你的 System.out
在事实上使用 Unicode,可能是 UTF-8。
我希望我知道 how to get the Charset
of System.out
。
此行为将来可能会改变,proposal (JEP 400) to use UTF-8 by default 跨平台。
详情
其实你问的是Unicode rather than ASCII。 ASCII 只有 128 个字符。
你说:
My main concern is the 0x96 value, because it is defined as a control character (characters in the range of 128 - 159).
实际上,控制字符的范围在 Unicode(和 ASCII)中从 127 开始,而不是 128。代码点 127 是 DELETE character。所以127-159是控制字符。
首先,让我们拆分您输入的十六进制代码字符串。
final List < String > hexInputs = List.of( "C0:A8:96:FE".split( ":" ) );
System.out.println( "hexInputs = " + hexInputs );
当运行.
hexInputs = [C0, A8, 96, FE]
现在将每个十六进制文本转换为十六进制整数。我们将该整数用作 Unicode code point.
与其依赖某些默认字符编码,不如明确设置 System.out
的 Charset
。我不是这方面的专家,但一些网络搜索发现了下面的代码,我们将 System.out
包装在一个新的 PrintStream
中,同时通过其名称设置 Charset
。我找不到 获取 PrintStream
的 Charset
的方法,所以 I asked.
UTF-8
// UTF-8
System.out.println( "----------| UTF-8 |--------------------------" );
try
{
PrintStream printStream = new PrintStream( System.out , true , StandardCharsets.UTF_8.name() ); // "UTF-8".
for ( String hexInput : hexInputs )
{
int codePoint = Integer.parseInt( hexInput , 16 );
String string = Character.toString( codePoint );
printStream.println( "hexInput: " + hexInput + " = codePoint: " + codePoint + " = string: [" + string + "] = isLetter: " + Character.isLetter( codePoint ) + " = name: " + Character.getName( codePoint ) );
}
}
catch ( UnsupportedEncodingException e )
{
e.printStackTrace();
}
当运行.
----------| UTF-8 |--------------------------
hexInput: C0 = codePoint: 192 = string: [À] = isLetter: true = name: LATIN CAPITAL LETTER A WITH GRAVE
hexInput: A8 = codePoint: 168 = string: [¨] = isLetter: false = name: DIAERESIS
hexInput: 96 = codePoint: 150 = string: [] = isLetter: false = name: START OF GUARDED AREA
hexInput: FE = codePoint: 254 = string: [þ] = isLetter: true = name: LATIN SMALL LETTER THORN
Windows-1252
接下来,我们做同样的事情,但将 "windows-1252"
设置为我们包装的 System.out
的 Charset
。在进行包装之前,我们验证这样的字符编码在我们当前的 JVM 上确实可用。
// windows-1252
System.out.println( "----------| windows-1252 |--------------------------" );
// Verify windows-1252 charset is available on the current JVM.
String windows1252CharSetName = "windows-1252";
boolean isWindows1252CharsetAvailable = Charset.availableCharsets().keySet().contains( windows1252CharSetName );
if ( isWindows1252CharsetAvailable )
{
System.out.println( "isWindows1252CharsetAvailable = " + isWindows1252CharsetAvailable );
} else
{
System.out.println( "FAIL - No charset available for name: " + windows1252CharSetName );
}
try
{
PrintStream printStream = new PrintStream( System.out , true , windows1252CharSetName );
for ( String hexInput : hexInputs )
{
int codePoint = Integer.parseInt( hexInput , 16 );
String string = Character.toString( codePoint );
printStream.println( "hexInput: " + hexInput + " = codePoint: " + codePoint + " = string: [" + string + "] = isLetter: " + Character.isLetter( codePoint ) + " = name: " + Character.getName( codePoint ) );
}
}
catch ( UnsupportedEncodingException e )
{
e.printStackTrace();
}
当运行.
----------| windows-1252 |--------------------------
isWindows1252CharsetAvailable = true
hexInput: C0 = codePoint: 192 = string: [�] = isLetter: true = name: LATIN CAPITAL LETTER A WITH GRAVE
hexInput: A8 = codePoint: 168 = string: [�] = isLetter: false = name: DIAERESIS
hexInput: 96 = codePoint: 150 = string: [?] = isLetter: false = name: START OF GUARDED AREA
hexInput: FE = codePoint: 254 = string: [�] = isLetter: true = name: LATIN SMALL LETTER THORN
Latin-1
我们也可以尝试 Latin-1,产生不同的结果。
// ISO-8859-1
System.out.println( "----------| Latin-1 |--------------------------" );
// Verify that charset is available on the current JVM.
String latin1CharsetName = "ISO-8859-1"; // Also known as "Latin-1".
boolean isLatin1CharsetNameAvailable = Charset.availableCharsets().keySet().contains( latin1CharsetName );
if ( isLatin1CharsetNameAvailable )
{
System.out.println( "isLatin1CharsetNameAvailable = " + isLatin1CharsetNameAvailable );
} else
{
System.out.println( "FAIL - No charset available for name: " + latin1CharsetName );
}
try
{
PrintStream printStream = new PrintStream( System.out , true , latin1CharsetName );
for ( String hexInput : hexInputs )
{
int codePoint = Integer.parseInt( hexInput , 16 );
String string = Character.toString( codePoint );
printStream.println( "hexInput: " + hexInput + " = codePoint: " + codePoint + " = string: [" + string + "] = isLetter: " + Character.isLetter( codePoint ) + " = name: " + Character.getName( codePoint ) );
}
}
catch ( UnsupportedEncodingException e )
{
e.printStackTrace();
}
当运行.
----------| Latin-1 |--------------------------
isLatin1CharsetNameAvailable = true
hexInput: C0 = codePoint: 192 = string: [�] = isLetter: true = name: LATIN CAPITAL LETTER A WITH GRAVE
hexInput: A8 = codePoint: 168 = string: [�] = isLetter: false = name: DIAERESIS
hexInput: 96 = codePoint: 150 = string: [�] = isLetter: false = name: START OF GUARDED AREA
hexInput: FE = codePoint: 254 = string: [�] = isLetter: true = name: LATIN SMALL LETTER THORN
结论
所以你可以看到,当对我们包装的 System.out
的 Charset
进行硬编码时,我们确实看到了差异。使用 UTF-8,我们得到实际字符 [À], [¨], [], [þ]
,而使用 windows-1252,我们得到三个时髦的问号字符和一个常规问号 [�], [�], [?], [�]
。请记住,我们在代码中添加了方括号。
我的代码的这种行为符合我的预期,显然也符合您的预期。这四个 hex/decimal 整数中的两个是 Unicode 中的字母,而其中 none 是 Windows 1252 字符集或 Latin-1 中的字母。对我来说唯一神秘的是,十六进制 96 十进制 150 数字有两种不同的表示形式,一个空 space 与 UTF-8 而一个问号与 windows-1252,然后是一个时髦的问题-在 Latin-1 下标记。
结论:您的 System.out
没有使用您认为正在使用的 Charset
。我怀疑虽然 JVM’s default Charset
of your JVM may be named "windows-1252", your System.out
is actually the Unicode character set, likely with UTF-8 编码。
注意reader:如果不熟悉字符集和字符编码,推荐有趣易读的post、The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).