如何使用从右到左的语言以 unicode 编码数字？（正常化）

How are numbers encoded in unicode with right-to-left languages? (Normalisation)

在一次信息检索讲座中，有一些幻灯片讨论了用于索引和查询文本文档的标记的词形还原 and/or 规范化。其中一种提到了从右到左的语言，例如阿拉伯语，但在 unicode 中，它们的编码方式 logically/sequentially 与从左到右的语言相同，只是从右到左显示。这是有道理的，但是数字是如何编码的（特别是那些使用罗马数字的数字）？

在英语中，“1962”年的字符将按顺序编码为 1962，使用 unicode 和大多数其他字符编码。但是，在阿拉伯语中，不清楚演讲幻灯片中显示的年份“1962”是按顺序编码为 1962 然后从左到右呈现（因此 document is technically bidirectional），还是编码为2691 并像文本一样从右到左呈现。

这对索引很重要，因此阿拉伯语和英语文本中的 1962 年在索引中都是相同的标记。我是否需要反转阿拉伯语中的字符以对其进行规范化，或者它们的编码顺序是否与英语相同？

我想我找到了 Unicode Bi-Directional Text 的答案。 Unicode 意味着 逻辑编码 而不是它呈现的顺序，因此呈现算法仍然可以针对不同的段落宽度正确地执行换行符（并且还解释了为什么从右到左有时需要明确启用支持，因为渲染起来更复杂）。

从维基百科来看，unicode 似乎将字符分为四种类型 "orders"：强、弱、中性和明确。数字属于弱排序，因为它们具有模糊的方向性。来自维基百科：

Unless a directional override is present numbers are always encoded (and entered) big-endian, and the numerals rendered LTR. The weak directionality only applies to the placement of the number in its entirety. (1)

因此，据我所知，数字“1962”在典型的从左到右的字符串中应该以相同的逻辑顺序 (1962) 进行编码，就像在右- 到左边的字符。

如何使用从右到左的语言以 unicode 编码数字？（正常化）

How are numbers encoded in unicode with right-to-left languages? (Normalisation)

unicode

encoding

arabic

lemmatization

right-to-left

如何使用从右到左的语言以 unicode 编码数字？ （正常化）

How are numbers encoded in unicode with right-to-left languages? (Normalisation)

unicode

encoding

arabic

lemmatization

right-to-left

如何使用从右到左的语言以 unicode 编码数字？（正常化）