我想知道有多少个 unicode 组成一个印地语字符
I want to kown how many unicode make one hindi character
我用过python:
for m in regex.findall(r"\X", 'ल्लील्ली', regex.UNICODE):
for i in m:
print(i, i.encode('unicode-escape'))
print('--------')
结果显示 ल्ली 有 2 个印地文字符:
ल b'\u0932'
् b'\u094d'
--------
ल b'\u0932'
ी b'\u0940'
--------
错了,其实ल्ली是一个印地语字符。如何通过多少个unicode组合得到印地语字符(如ल्ली)
总之,我想把'कृपयाल्ली'
拆分成'कृ'
,'प'
,'या'
,'ल्ली'
我不太确定这是否正确,因为我是芬兰人并且不精通印地语,但这会将字符与任何后续的 Unicode 标记字符合并:
import unicodedata
def merge_compose(s: str):
current = []
for c in s:
if current and not unicodedata.category(c).startswith("M"):
yield current
current = []
current.append(c)
if current:
yield current
for group in merge_compose("कृपयाल्ली"):
print(group, len(group), "->", "".join(group))
输出为
['क', 'ृ'] 2 -> कृ
['प'] 1 -> प
['य', 'ा'] 2 -> या
['ल', '्'] 2 -> ल्
['ल', 'ी'] 2 -> ली
我在其他问题中找到了答案。
def splitclusters(s):
"""Generate the grapheme clusters for the string s. (Not the full
Unicode text segmentation algorithm, but probably good enough for
Devanagari.)
"""
virama = u'\N{DEVANAGARI SIGN VIRAMA}'
cluster = u''
last = None
for c in s:
cat = unicodedata.category(c)[0]
if cat == 'M' or cat == 'L' and last == virama:
cluster += c
else:
if cluster:
yield cluster
cluster = c
last = c
if cluster:
yield cluster
#print(list(splitclusters(word)))
我用过python:
for m in regex.findall(r"\X", 'ल्लील्ली', regex.UNICODE):
for i in m:
print(i, i.encode('unicode-escape'))
print('--------')
结果显示 ल्ली 有 2 个印地文字符:
ल b'\u0932'
् b'\u094d'
--------
ल b'\u0932'
ी b'\u0940'
--------
错了,其实ल्ली是一个印地语字符。如何通过多少个unicode组合得到印地语字符(如ल्ली)
总之,我想把'कृपयाल्ली'
拆分成'कृ'
,'प'
,'या'
,'ल्ली'
我不太确定这是否正确,因为我是芬兰人并且不精通印地语,但这会将字符与任何后续的 Unicode 标记字符合并:
import unicodedata
def merge_compose(s: str):
current = []
for c in s:
if current and not unicodedata.category(c).startswith("M"):
yield current
current = []
current.append(c)
if current:
yield current
for group in merge_compose("कृपयाल्ली"):
print(group, len(group), "->", "".join(group))
输出为
['क', 'ृ'] 2 -> कृ
['प'] 1 -> प
['य', 'ा'] 2 -> या
['ल', '्'] 2 -> ल्
['ल', 'ी'] 2 -> ली
我在其他问题中找到了答案。
def splitclusters(s):
"""Generate the grapheme clusters for the string s. (Not the full
Unicode text segmentation algorithm, but probably good enough for
Devanagari.)
"""
virama = u'\N{DEVANAGARI SIGN VIRAMA}'
cluster = u''
last = None
for c in s:
cat = unicodedata.category(c)[0]
if cat == 'M' or cat == 'L' and last == virama:
cluster += c
else:
if cluster:
yield cluster
cluster = c
last = c
if cluster:
yield cluster
#print(list(splitclusters(word)))