给定的字符集涵盖哪些书写系统
Which writing systems does given character set cover
找出一组给定的 Unicode 字符支持哪些书写系统(如拉丁文、希伯来文、阿拉伯文、片假名、中文字符)的最简单方法是什么?
检查集合中每个字符的 Script
和 Script_Extensions
属性,如 UAX #24 中所述。
Unicode characters are divided into non-overlapping ranges called
blocks [Blocks]. Many of these blocks have a name derived from
a script name, because characters of that script are primarily encoded
in that block. However, blocks and scripts differ in the following
ways:
- Blocks are simply ranges, and often contain code points that are unassigned.
- Characters from the same script may be encoded in several different blocks.
- Characters from different scripts may be encoded in the same block.
As a result, using the block names as simplistic substitute for script
identity generally leads to poor results. For example, see Annex A,
Character Blocks, in Unicode Technical Standard #18, "Unicode Regular
Expressions" [UTS18].
后面的文件里面[UTS18], pay your priority attention to Writing Systems Versus Blocks in Annex A: Character Blocks.
在这一点上,我倾向于测试字符集中是否出现了来自脚本的足够字形。
该方法需要两个准备步骤:
整理一套Unicode支持的书写系统(脚本)
对于每个脚本,定义一个包含该脚本字符的字符集
然后我可以通过测试“脚本 X 的字符集的字符是否也足够字符集 A 的成员”来解决“字符集 A 是否支持脚本 X”的问题。如果我对步骤 (1) 中的每个脚本都这样做,我会得到一个受支持脚本的列表。
一二三提供的link引用了一个data file,将Unicode字符映射到它们各自的脚本,这在步骤(1)和(2)中是非常宝贵的。
找出一组给定的 Unicode 字符支持哪些书写系统(如拉丁文、希伯来文、阿拉伯文、片假名、中文字符)的最简单方法是什么?
检查集合中每个字符的 Script
和 Script_Extensions
属性,如 UAX #24 中所述。
Unicode characters are divided into non-overlapping ranges called blocks [Blocks]. Many of these blocks have a name derived from a script name, because characters of that script are primarily encoded in that block. However, blocks and scripts differ in the following ways:
- Blocks are simply ranges, and often contain code points that are unassigned.
- Characters from the same script may be encoded in several different blocks.
- Characters from different scripts may be encoded in the same block.
As a result, using the block names as simplistic substitute for script identity generally leads to poor results. For example, see Annex A, Character Blocks, in Unicode Technical Standard #18, "Unicode Regular Expressions" [UTS18].
后面的文件里面[UTS18], pay your priority attention to Writing Systems Versus Blocks in Annex A: Character Blocks.
在这一点上,我倾向于测试字符集中是否出现了来自脚本的足够字形。
该方法需要两个准备步骤:
整理一套Unicode支持的书写系统(脚本)
对于每个脚本,定义一个包含该脚本字符的字符集
然后我可以通过测试“脚本 X 的字符集的字符是否也足够字符集 A 的成员”来解决“字符集 A 是否支持脚本 X”的问题。如果我对步骤 (1) 中的每个脚本都这样做,我会得到一个受支持脚本的列表。
一二三提供的link引用了一个data file,将Unicode字符映射到它们各自的脚本,这在步骤(1)和(2)中是非常宝贵的。