TextEncoder / TextDecoder 不往返

Question

我肯定遗漏了一些关于 TextEncoder 和 TextDecoder 行为的信息。在我看来，以下代码应该是往返的，但它似乎不是：

new TextDecoder().decode(new TextEncoder().encode(String.fromCharCode(55296))).charCodeAt(0);

因为我只是对字符串进行编码和解码，字符代码看起来应该是相同的，但是这个 returns 65533 而不是 55296。我错过了什么？

Answer 1

根据一些探索，TextEncoder.encode() 方法似乎 take an argument of type USVString, where USV stands for Unicode Scalar Value. According to this page，USV 不能是高代理或低代理代码点。

此外，根据 MDN：

A USVString is a sequence of Unicode scalar values. This definition differs from that of DOMString or the JavaScript String type in that it always represents a valid sequence suitable for text processing, while the latter can contain surrogate code points.

所以，我的猜测是您对 encode() 的 String 参数正在转换为 USVString（隐含地或在 encode() 内）。基于 this page, it looks like to convert from String to USVString, it first converts it to a DOMString, and then follows this procedure, which includes replacing all surrogates with U+FFFD, which is the code point you see, 65533, the "Replacement Character".

我认为 String.fromCharCode(55296).charCodeAt(0) 起作用的原因是因为它不需要执行此 String -> USVString 转换。

至于为什么 TextEncoder.encode() 是这样设计的，我不太了解 unicode 细节，无法尝试解释，但我怀疑这是为了简化实现，因为它支持的唯一输出编码似乎是UTF-8，在 Uint8Array 中。我猜想需要一个没有代理项的 USVString 参数（而不是原生 UTF-16 String 可能 with 代理项）将编码简化为 UTF-8 ，或者可能使某些 encoding/decoding 用例更简单？

Answer 2

对于那些（像我一样）不确定“unicode 代理”是什么的人：

问题

字符代码 55296 不是有效字符 本身。所以这部分代码已经有问题了：

String.fromCharCode(55296)

由于该 charCode 没有有效字符，.fromCharCode 函数 returns 错误字符“�”代替，恰好有代码 65533。

像 55296 这样的代码仅作为对代码的 第一个 元素才有效。成对的代码用于表示不适合 Unicode Basic Multilingual Plane 的字符。（在基本多语言平面之外有 lot 个字符，因此它们需要两个 16 位数字来编码它们。）

例如，这里是代码 55296 的有效用法：

console.log(String.fromCharCode(55296, 57091)

是returns字“”，出自古文Etruscan alphabet。

解决方案

此代码 round-trip 正确：

const code = new TextEncoder().encode(String.fromCharCode(55296, 57091));
console.log(new TextDecoder().decode(code).charCodeAt(0));  // Returns 55296

但请注意：.charCodeAt 仅 returns 一对的第一部分。更安全的选择可能是使用 String.codePointAt 将字符转换为 单个 32 位代码:

const code = new TextEncoder().encode(String.fromCharCode(55296, 57091));
console.log(new TextDecoder().decode(code).codePointAt(0));  // Returns 66307

TextEncoder / TextDecoder 不往返

TextEncoder / TextDecoder not round tripping

javascript

unicode

问题

解决方案