我想知道有多少个 unicode 组成一个印地语字符

Question

我用过python：

for m in regex.findall(r"\X", 'ल्लील्ली', regex.UNICODE):
    for i in m:
        print(i, i.encode('unicode-escape'))
    print('--------')

结果显示 ल्ली 有 2 个印地文字符：

ल b'\u0932'
् b'\u094d'
--------
ल b'\u0932'
ी b'\u0940'
--------

错了，其实ल्ली是一个印地语字符。如何通过多少个unicode组合得到印地语字符（如ल्ली）

总之，我想把'कृपयाल्ली'拆分成'कृ'，'प'，'या'，'ल्ली'

Answer 1

我不太确定这是否正确，因为我是芬兰人并且不精通印地语，但这会将字符与任何后续的 Unicode 标记字符合并：

import unicodedata


def merge_compose(s: str):
    current = []
    for c in s:
        if current and not unicodedata.category(c).startswith("M"):
            yield current
            current = []
        current.append(c)
    if current:
        yield current


for group in merge_compose("कृपयाल्ली"):
    print(group, len(group), "->", "".join(group))

输出为

['क', 'ृ'] 2 -> कृ
['प'] 1 -> प
['य', 'ा'] 2 -> या
['ल', '्'] 2 -> ल्
['ल', 'ी'] 2 -> ली

Answer 2

我在其他问题中找到了答案。

def splitclusters(s):
    """Generate the grapheme clusters for the string s. (Not the full
    Unicode text segmentation algorithm, but probably good enough for
    Devanagari.)

    """
    virama = u'\N{DEVANAGARI SIGN VIRAMA}'
    cluster = u''
    last = None
    for c in s:
        cat = unicodedata.category(c)[0]
        if cat == 'M' or cat == 'L' and last == virama:
            cluster += c
        else:
            if cluster:
                yield cluster
            cluster = c
        last = c
    if cluster:
        yield cluster

#print(list(splitclusters(word)))

我想知道有多少个 unicode 组成一个印地语字符

I want to kown how many unicode make one hindi character

python

unicode

hindi