使用替换 java 中的无效字符将 UTF8 字符串转换为 UCS-2

Question

我在 UTF8 中遇到了问题：

"RedRöses"

我需要将其转换为有效的 UCS-2（或没有 BOM 的固定大小的 UTF-16BE，它们是相同的东西）编码，因此输出将是： “红玫瑰”作为 UCS-2 范围外的“”。

我尝试过的：

 @Test
public void testEncodeProblem() throws CharacterCodingException {
    String in = "Red\uD83C\uDF39\uD83C\uDF39Röses";
    ByteBuffer input = ByteBuffer.wrap(in.getBytes());

    CharsetDecoder utf8Decoder = StandardCharsets.UTF_16BE.newDecoder();
    utf8Decoder.onMalformedInput(CodingErrorAction.REPLACE);
    utf8Decoder.onUnmappableCharacter(CodingErrorAction.REPLACE);
    utf8Decoder.replaceWith(" ");

    CharBuffer decoded = utf8Decoder.decode(input);

    System.out.println(decoded.toString()); //  剥擰龌맰龌륒쎶獥 
}

没有。

    @Test
public void testEncodeProblem() {
    String in = "Red\uD83C\uDF39\uD83C\uDF39Röses";
    byte[] bytes = in.getBytes(StandardCharsets.UTF_16BE);
    String res = new String(bytes);
    System.out.println(res); //  Red�<�9�<�9Röses
}

没有。

请注意，“ö”是有效的 UCS-2 符号。

任何 ideas/libraries?

Answer 1

不幸的是，这两个代码段实际上都不起作用，那是因为您误解了 UTF-16 编码。 UTF-16 CAN 编码那些表情符号，它不是固定宽度的。没有'fixed with UTF-16 encoding'这样的东西。有.. UCS2。这不是 UTF-16。 BE 部分没有达到 'fixed width'，它只是锁定字节顺序。这就是为什么这两个打印玫瑰。 Java 不幸的是，它没有附带 UCS2 编码系统，这使得这项工作变得更加困难和丑陋。

此外，这两个代码段都失败了，因为您正在调用禁止的方法。

任何时候将字节转换为字符或反之亦然，字符转换正在发生。你不能选择退出。尽管如此，仍然存在一些方法，它们不使用任何参数来指示您要为此使用哪种字符集编码。这些是被禁止的方法：这些默认为 'system default'，看起来就像有人挥动了一根魔杖并制作了它，这样我们就可以将字符转换为字节，反之亦然，而无需担心字符编码。

解决办法是永远不要使用被禁止的方法。更好的是，告诉你的 IDE 它应该将它们标记为错误。唯一的例外是你知道 API 默认不是 'platform default'，而是一些理智的东西——我唯一知道的是 Files.* API，默认为UTF-8 而不是平台默认值。因此，使用无字符集变体是 acceptable 那里。

如果您确实必须具有平台默认值（仅对命令行工具有意义），请通过传递 Charset.defaultCharset().

使其明确

禁用方法列表很长，但是new String(bytes)和string.getBytes()都在上面。 不要使用这些 methods/constructors。曾经.

此外，您的第一个片段很混乱。你想 ENCODE 一个字符串（一个字符串已经是字符并且没有编码。它就是这样。那么当没有什么可解码的时候你为什么要制作解码器？）到UTF -16，不解码：

String in = "Red\uD83C\uDF39\uD83C\uDF39Röses";
CharBuffer input = CharBuffer.wrap(in);
CharsetEncoder utf16Encoder = StandardCharsets.UTF_16BE.newEncoder();
utf16Encoder.onUnmappableCharacter(CodingErrorAction.REPLACE);
utf16Encoder.replaceWith(" ");
ByteBuffer encoded = utf16Encoder.encode(input);

System.out.println(new String(encoded.array(), StandardCharsets.UTF16_BE));

或第二个片段：

@Test
public void testEncodeProblem() {
    String in = "Red\uD83C\uDF39\uD83C\uDF39Röses";
    byte[] bytes = in.getBytes(StandardCharsets.UTF_16BE);
    String res = new String(bytes, StandardCharsets.UTF_16BE);
    System.out.println(res);
}

但是，正如我所说，两者都只是打印玫瑰，因为那些是在 UTF_16.

中代表 table

那么，如何完成工作呢？如果 java 内置了 UCS2 编码，那么将 StandardCharsets.UTF_16BE 替换为 StandardCharsets.UCS2 就很简单了，但没有这样的运气。所以，我想......可能 'by hand':

String in = "Red\uD83C\uDF39\uD83C\uDF39Röses";
ByteArrayOutputStream out = new ByteArrayOutputStream();
in.codePoints()
    .filter(a -> a < 65536)
    .forEach(a -> {
       out.write(a >> 8);
       out.write(a);
    });

// stream is ugly, but, because codePoints() was added in a time
// when oracle had just invented the shiny hammer, they are using it
// here for smearing butter on their sandwich. Silly geese. Oh well.

byte[] result = out.toByteArray();
// given that java has no way of reading UCS2, and UTF16BE doesn't fit,
// as there are chars representable in 2 bytes in UCS2 that take 3+ in
// UTF16BE, it's not possible to print this without another loop similar to above. 
// Let's just print the bytes and check em, by hand:

for (byte r : result) System.out.print(" " + (r & 0xFF));
System.out.println();
// For the roses string, printing with UTF-16BE does actually work,
// but it won't be true for all input strings...
System.out.println(new String(result, StandardCharsets.UTF_16BE));

耶！成功！

注意：codePointAt 可以工作并避免此处的丑陋流，但 cPA 的输入不在 'codepoint index' 中，而是在 'char index' 中，这使事情变得相当复杂；您必须为任何代理对增加 2。

对 unicode、UCS2 和 UTF-16 的一些反思：

Unicode 是一个巨大的 table，它将 0 到 1,112,064（大约 20 位半）之间的任何数字映射到字符、控制概念、货币、标点符号、表情符号、方框绘图或其他个性化的概念。

像 UTF-8 或 US_ASCII 这样的编码定义了将这些数字中的一些或全部转换为一系列字节，这样它也可以被解码回一系列代码点，这些代码点是通常以 32 位存储，因为它们不适合 16 位，并且没有任何体系结构有意义地处理例如。 24 位或其他。

为了适应 UCS2/UTF-16，在 unicode 规范 中没有从 0xD800 到 0xDFFF 的字符，这是有意的，永远不会有。

这意味着 UCS2 和 UTF-16 或多或少是一回事，只有一个 'trick':

对于任何低于 65536 的 unicode 数字（理论上可以容纳 2 个字节），对于 UTF-16 编码（可以对表情符号等进行编码），UTF-16 编码就是..数字。直线上升。作为 2 个字节。 D800-DFFF 不可能发生，因为那些代码点是故意的。

对于高于 65536 的任何内容，使用 D800 到 DFFF 的空闲块来生成所谓的代理对。第二个 'character'（第二个 2 字节块）与我们可以用 D800-DFFF 范围存储的 11 位数据相结合，总共 16+11 = 27 位，足以覆盖其余部分。

因此，UTF-16 会将任何 unicode 代码点编码为 2 个字节或 4 个字节。

UCS-2 作为一个术语几乎失去了它的意义。最初，它的意思是每个 'character' 恰好 2 个字节，不多也不少，现在仍然是那个意思，但是 'a character' 的意思已经被扭曲得面目全非了：那朵玫瑰？它算作 2 个字符。在 java - x.length() returns 2，而不是 1 中尝试。UCS-2 的一个比较合理的定义是：1 个字符实际上意味着 1 个字符，每个字符由 2 个字节表示，并且如果您尝试存储一个不适合的字符（将是代理对），那么，这些字符将无法编码，因此崩溃或应用 on-unreprestable-character-instead 占位符。不幸的是，这不是（总是）UCS-2 的意思，这让我们不得不重新编写应用此操作的任何代码（丢弃/替换为占位符任何代理对，以便字节长度正好是 2*number代码点）我们自己。

请注意，根据 java 的 char 非常接近理想的 UCS2（因为它是 16 -位数，在 java 规范中硬编码）：您可以循环遍历所有字符（如 java 的 char）并丢弃任何 c >= 0xD800 && c < 0xE000， 以及紧随其后的字符，这将去除玫瑰。

使用替换 java 中的无效字符将 UTF8 字符串转换为 UCS-2

Transform UTF8 string to UCS-2 with replace invalid characters in java

java

unicode

encoding

utf-8

utf-16