给定一个 Unicode 代码点列表,如何将它们拆分为一个 Unicode 字符列表?

Given a list of Unicode code points, how does one split them into a list of Unicode characters?

我正在为 Unicode 文本编写一个词法分析器。许多 Unicode 字符需要多个代码点(即使在规范组合之后)。例如,tuple(map(ord, unicodedata.normalize('NFC', 'ā́'))) 的计算结果为 (257, 769)。我怎么知道两个字符之间的边界在哪里?此外,我想存储文本的非规范化版本。我的输入保证是 Unicode。

到目前为止,这是我拥有的:

from unicodedata import normalize

def split_into_characters(text):
    character = ""
    characters = []

    for i in range(len(text)):
        character += text[i]

        if len(normalize('NFKC', character)) > 1:
            characters.append(character[:-1])
            character = character[-1]

    if len(character) > 0:
        characters.append(character)

    return characters

print(split_into_characters('Puélla in vī́llā vīcī́nā hábitat.'))

这会错误地打印以下内容:

['P', 'u', 'é', 'l', 'l', 'a', ' ', 'i', 'n', ' ', 'v', 'ī', '́', 'l', 'l', 'ā', ' ', 'v', 'ī', 'c', 'ī', '́', 'n', 'ā', ' ', 'h', 'á', 'b', 'i', 't', 'a', 't', '.']

我希望它打印以下内容:

['P', 'u', 'é', 'l', 'l', 'a', ' ', 'i', 'n', ' ', 'v', 'ī́', 'l', 'l', 'ā', ' ', 'v', 'ī', 'c', 'ī́', 'n', 'ā', ' ', 'h', 'á', 'b', 'i', 't', 'a', 't', '.']

感知字符之间的边界可以用 Unicode 的 Grapheme Cluster Boundary algorithm. Python's unicodedata module doesn't have the necessary data for the algorithm (the Grapheme_Cluster_Break property), but complete implementations can be found in libraries like PyICU and uniseg 来识别。

您可能想要使用 pyuegc library, an implementation of the Unicode algorithm for breaking code point sequences into extended grapheme clusters as specified in UAX #29

from pyuegc import EGC  # pip install pyuegc

string = 'Puélla in vī́llā vīcī́nā hábitat.'
egc = EGC(string)
print(egc)
# ['P', 'u', 'é', 'l', 'l', 'a', ' ', 'i', 'n', ' ', 'v', 'ī́', 'l', 'l', 'ā', ' ', 'v', 'ī', 'c', 'ī́', 'n', 'ā', ' ', 'h', 'á', 'b', 'i', 't', 'a', 't', '.']

print(len(string))
# 35
print(len(egc))
# 31