当用 3 个字符字节编码 UTF-8 字符串时,为什么长度只增加 1
When encoding a UTF-8 string with 3 character bytes how come the length only increases by 1
所以我试图确保我正确理解编码,所以我写了一个示例测试:
public class TestEncoding {
public static void main(String[] args) throws UnsupportedEncodingException {
TestEncoding testEncoding = new TestEncoding();
testEncoding.isLengthDifferenceBetweenUTF16UTF32();
}
private void isLengthDifferenceBetweenUTF16UTF32() throws UnsupportedEncodingException {
String eightBitString = new String("Hi how are you?ࢤ".getBytes(StandardCharsets.UTF_8));
String sixteenBitString = new String("Hi how are you?ࢤ".getBytes(StandardCharsets.UTF_16LE));
String thirtyTwoBitString= new String("Hi how are you?ࢤ".getBytes("UTF-32"));
System.err.println("8 bit: " + eightBitString.length());
System.err.println("16 bit: " + sixteenBitString.length());
System.err.println("32 bit: " + thirtyTwoBitString.length());
}
}
然后对于输出我得到:
8 bit: 16
16 bit: 32
32 bit: 64
我的问题是为什么 Hi how are you?
末尾的特殊字符 ࢤ
没有变成 15,因为 Hi how are you?
+ 3 是特殊字符,给我共18个
String eightBitString = new String("Hi how are you?ࢤ".getBytes(StandardCharsets.UTF_8));
String sixteenBitString = new String("Hi how are you?ࢤ".getBytes(StandardCharsets.UTF_16LE));
String thirtyTwoBitString= new String("Hi how are you?ࢤ".getBytes("UTF-32"));
这些行采用字符串,将它们转换为指定字符集中的字节,然后将它们转换回 JVM 默认字符集中的字符串。
这取决于默认值,但字节序列在默认字符集中可能无效。在这种情况下,字符串将包含无效序列的占位符字符。它们在结果字符串中看起来像 �
,它是一个字符。
例如,如果默认字符集是 UTF-8,则这 3 个字符串是:
Hi how are you?ࢤ
Hi how are you?�
Hi how are you?�
如果您想比较这些字符集中字节表示的长度,请不要转换回字符串:
byte[] eightBit = "Hi how are you?ࢤ".getBytes(StandardCharsets.UTF_8);
System.out.println(eightBit.length);
等等
所以我试图确保我正确理解编码,所以我写了一个示例测试:
public class TestEncoding {
public static void main(String[] args) throws UnsupportedEncodingException {
TestEncoding testEncoding = new TestEncoding();
testEncoding.isLengthDifferenceBetweenUTF16UTF32();
}
private void isLengthDifferenceBetweenUTF16UTF32() throws UnsupportedEncodingException {
String eightBitString = new String("Hi how are you?ࢤ".getBytes(StandardCharsets.UTF_8));
String sixteenBitString = new String("Hi how are you?ࢤ".getBytes(StandardCharsets.UTF_16LE));
String thirtyTwoBitString= new String("Hi how are you?ࢤ".getBytes("UTF-32"));
System.err.println("8 bit: " + eightBitString.length());
System.err.println("16 bit: " + sixteenBitString.length());
System.err.println("32 bit: " + thirtyTwoBitString.length());
}
}
然后对于输出我得到:
8 bit: 16
16 bit: 32
32 bit: 64
我的问题是为什么 Hi how are you?
末尾的特殊字符 ࢤ
没有变成 15,因为 Hi how are you?
+ 3 是特殊字符,给我共18个
String eightBitString = new String("Hi how are you?ࢤ".getBytes(StandardCharsets.UTF_8));
String sixteenBitString = new String("Hi how are you?ࢤ".getBytes(StandardCharsets.UTF_16LE));
String thirtyTwoBitString= new String("Hi how are you?ࢤ".getBytes("UTF-32"));
这些行采用字符串,将它们转换为指定字符集中的字节,然后将它们转换回 JVM 默认字符集中的字符串。
这取决于默认值,但字节序列在默认字符集中可能无效。在这种情况下,字符串将包含无效序列的占位符字符。它们在结果字符串中看起来像 �
,它是一个字符。
例如,如果默认字符集是 UTF-8,则这 3 个字符串是:
Hi how are you?ࢤ
Hi how are you?�
Hi how are you?�
如果您想比较这些字符集中字节表示的长度,请不要转换回字符串:
byte[] eightBit = "Hi how are you?ࢤ".getBytes(StandardCharsets.UTF_8);
System.out.println(eightBit.length);
等等