自定义文本处理的可用 Unicode 范围

Question

我正在开发一个处理器，将文本分成带有标记的块：

LOREM IPSUM SED AMED

将被解析为：

{word:1}LOREM{/word:1}{space:2}
{word:3}IPSUM{/word:3}{space:4}
{word:5}SED{/word:5}{space:6}
{word:7}AMED{/word:7}

但我不想使用“{word}”等，因为它会导致处理器停机，因为它又是一个字符串...我需要这样标记：

\E002[=12=]01 LOREM \E003[=12=]01 \E004[=12=]02
\E002[=12=]03 IPSUM \E003[=12=]04 \E004[=12=]05
\E002[=12=]06 SED   \E003[=12=]06 \E004[=12=]07
\E002[=12=]08 AMED  \E003[=12=]08

首先\E002表示元素类型号，它的最后一位表示元素的结束。所以元素编号增加 +2.
第二个\0001表示堆叠的元素索引
我只是在这个例子中无关紧要地使用了 \E002。

但是 \0001 也在 Unicode 范围内使用，这让我重新开始...

那么我可以使用哪个 unicode 范围？ \ff0000?或者我该如何解决这个问题？

谢谢！

Answer 1

Unicode 联盟想到了这一点。有一系列 Unicode 代码点旨在从不表示可显示字符，但元代码代替：

Noncharacters are code points that are permanently reserved and will never have characters assigned to them.
...
Tag characters were intended to support a general scheme for the internal tagging of text streams in the absence of other mechanisms, such as markup languages. The use of tag characters for language tagging is deprecated.
(http://www.unicode.org/versions/Unicode9.0.0/ch23.pdf)

您应该能够使用 常规控制字符 作为 "private" 标记，因为这些字符永远不会出现在正确的字符串中。这将是从 U+0000 到 U+001F 的范围，不包括制表符 (U+0009)、常见的 "returns"（U+000A 和 U+000D），以及，为了安全，U+0000 本身（一些库不喜欢字符串中间的 Null 字符）。

Non-characters
Noncharacters are code points that are permanently reserved in the Unicode Standard for internal use. They are not recommended for use in open interchange of Unicode text data.

您可以使用 U+FEFF（目前官方定义为非字符），或 U+FFFE 和 U+FFFF。还有几个 "officially not-a-characters" 定义，你可以相当确定它们不会出现在常规文本字符串中。

一些具有预定义定义且极不可能出现在纯文本字符串中的随机序列是：

Specials: U+FFF0–U+FFF8
The nine unassigned Unicode code points in the range U+FFF0..U+FFF8 are reserved for special character definitions.

Annotation Characters: U+FFF9–U+FFFB
An interlinear annotation consists of annotating text that is related to a sequence of annotated characters. For all regular editing and text-processing algorithms, the annotated characters are treated as part of the text stream. The annotating text is also part of the content, but for all or some text processing, it does not form part of the main text stream.

Tag Characters: U+E0000–U+E007F
This block encodes a set of 95 special-use tag characters to enable the spelling out of ASCIIbased string tags using characters that can be strictly separated from ordinary text content characters in Unicode.
(all quotations from the chapter as above)

按照惯例，您还可以使用 U+2028（行分隔符）and/or U+2029 段落分隔符。

从技术上讲，您使用 U+E000–U+F8FF（"Private Use Area"）是可以的，因为这些代码点只能定义一个明确的字符与某个字体。但是，如果您从包含字体的来源获取纯文本，则可能会弹出这些代码。

至于如何将其编码到您的字符串中：紧跟在您的私人标签标记之后的数字代码是否是有效的 Unicode 字符并不重要。如果您看到自己的标签标记之一，那么紧随其后的值始终是您自己的私有序列号。

如您所见，有很多可能性。我想最重要的标准是你是否想在这些字符串上使用其他函数。如果您创建的字符串在技术上是无效的 Unicode（例如，因为它包含非字符值），则某些外部函数可能会选择无法处理它们，或者静默删除错误值。在这种情况下，您需要严格遵守只使用 'valid' 个代码点的系统。

自定义文本处理的可用 Unicode 范围

Usable Unicode Ranges for Custom Text Process

unicode

text

text-processing

unicode-range