unicode 中的不可还原字素簇

Non reducable grapheme clusters in unicode

我认为 "user perceived character"（以下称为 UPC）迭代器在 unicode 库中非常有用。我所说的 UPC 是指 unicode standard annex 29 中讨论的含义，即用户认为是字符的含义，但可能在 unicode 中表示为代码点或字素簇。因为我通常使用拉丁语言，所以我总是想出像 "I want to handle ü as one UPC, regardless of whether the UPC is a grapheme cluster, or a single codepoint".

这样的例子

反对UPC迭代器（或字素簇迭代器，任君选择）计数器"You can normalize to NFC, and then use codepoint iteration"和"there is no use case for grapheme cluster iteration"的同事。

我一直在考虑以拉丁语为中心的用例，这可能无法很好地转化为 unicode 世界——就像我在做终端输出一样，我想将一列填充到 N 列宽度，所以我想要知道一个字符串中有多少个 UPC...

我想我想知道的是：

是否存在无法归一化为单个代码点的有意义的字素簇？在西方用户中有没有可能发生的事情？我假设韩语或阿拉伯语是这种情况，但我不得不承认在那里完全无知。
是否有任何其他语言提供 UPC/grapheme 群集 iteration/operations？ Unicode 规范有什么建议吗？

不清楚 UAX #29 为什么没有回答您的问题：

有许多个这样的字素簇，即使对于只使用拉丁字母的语言也是如此，因为并非所有组合标记都与所有其他letters/forms——例如，this table on Wikipedia 中的空格。 Table UAX #29 中的 1a 有几个非拉丁语示例。
这是 UAX #29 的目的：将字素集群操作推广到 Unicode 支持的所有语言。

(1) Are there any that are likely to occur among western users?

（竖起大拇指+浅肤色）。可能发生：在北半球的任何地方，只要有一个可以轻松访问表情符号的应用程序。

(2) Do any other languages provide UPC/grapheme cluster iteration/operations?

Rust 的 unicode_segmentation 板条箱（库）。

unicode 中的不可还原字素簇

Non reducable grapheme clusters in unicode

unicode

text-segmentation