BOCU-1 用于字符串的内部编码

Question

有些 languages/platforms 如 Java、Javascript、Windows、Dotnet、KDE 等使用 UTF16。其他一些人更喜欢 UTF8。

没有language/platform使用BOCU-1? What is the rationale for JEP 254 and JEP 254 equivalent for Dotnet的原因是什么？

是BOCU-1获得专利的原因吗？还有技术原因吗？

编辑

我的问题与 Java 无关。我所说的 JEP 254 是指该提案中提到的紧凑型 UTF-16。我的问题是，由于 BOCU-1 对于几乎所有 unicode 字符串都是紧凑的，为什么 language/platform 不在内部使用它，而不是 UTF-16 或 UTF-8。这种用法将提高任何字符串的缓存性能，而不仅仅是 ASCII 或 Latin-1。

这种用法也可能有助于以 The Language Server Index Format (LSIF).

等格式支持非拉丁编程语言

Answer 1

What is the reason that no language/platform uses BOCU-1?

对于 Stack Overflow，这个问题的范围太广了，不可能给出简明的答案。

但是，在 Java 的特定情况下，请注意有人在 2002 年提出 Java 采用 BOCU-1 作为 RFE（增强请求）的可能性。参见 JDK-4787935 (str) Reducing the memory footprint for Strings .

十年后，该错误以“不会修复”的决议关闭：

"Although this is a very interesting proposal, it is highly unlikely that BOCU or any other multi-byte encoding for internal use would be adopted. Furthermore, this comes down to a space-time tradeoff with unclear long-term consequences. Given the length of time this proposal has lingered, it seems appropriate to close it as will not fix".

What is the rationale for JEP 254...?

JEP 254 中有一个标题为“Motivation”的部分解释了这一点，特别是它指出“most String objects contain only Latin-1 字符”。但是，如果您不满意，请提出一个单独的问题。

首先查看 What topics can I ask about here?，确保它与 Stack Overflow 相关。评论 JEP 254 的两个人（Aleksey Shipilev 和 Brian Goetz）在这里回复了 SO，所以你可能会得到一个权威的答案。

What is the rationale for ... JEP 254 equivalent for Dotnet?

再次将此作为一个单独的 SO 问题提出。

Is the reason that BOCU-1 is patented?

That question is specifically off topic here: "Legal questions, including questions about copyright or licensing, are off-topic for Stack Overflow", though Wikipedia notes "BOCU-1 是唯一在 Unicode 网站上描述的 Unicode 压缩方案，已知它受到智能属性限制 ”。

Are there any technical reasons also?

一个非常重要的non-technical原因是the HTML5 specification explicitly forbids the use of BOCU-1!...

Avoid these encodings

The HTML5 specification calls out a number of encodings that you should avoid...

Documents must also not use CESU-8, UTF-7, BOCU-1, or SCSU encodings, since they... were never intended for Web content and the HTML5 specification forbids browsers from recognising them.

当然，这会引发 为什么 HTML 5 禁止使用 BOCU-1 的问题，我能找到的唯一技术原因是 this Mozilla documentation on HTML's <meta> element states:

Authors must not use CESU-8, UTF-7, BOCU-1 and/or SCSU as cross-site scripting attacks with these encodings have been demonstrated.

See this GitHub link 有关 BOCU-1 的 XSS 漏洞的更多详细信息。

另请注意，根据HTML5规范，所有主流浏览器明确表示不支持BOCU-1。

BOCU-1 用于字符串的内部编码

BOCU-1 for internal encoding of strings

unicode

utf-8

utf-16

character-encoding

non-ascii-characters