给定的字符集涵盖哪些书写系统

Which writing systems does given character set cover

找出一组给定的 Unicode 字符支持哪些书写系统(如拉丁文、希伯来文、阿拉伯文、片假名、中文字符)的最简单方法是什么?

检查集合中每个字符的 ScriptScript_Extensions 属性,如 UAX #24 中所述。

Scripts and Blocks:

Unicode characters are divided into non-overlapping ranges called blocks [Blocks]. Many of these blocks have a name derived from a script name, because characters of that script are primarily encoded in that block. However, blocks and scripts differ in the following ways:

  • Blocks are simply ranges, and often contain code points that are unassigned.
  • Characters from the same script may be encoded in several different blocks.
  • Characters from different scripts may be encoded in the same block.

As a result, using the block names as simplistic substitute for script identity generally leads to poor results. For example, see Annex A, Character Blocks, in Unicode Technical Standard #18, "Unicode Regular Expressions" [UTS18].

后面的文件里面[UTS18], pay your priority attention to Writing Systems Versus Blocks in Annex A: Character Blocks.

在这一点上,我倾向于测试字符集中是否出现了来自脚本的足够字形。

该方法需要两个准备步骤:

  1. 整理一套Unicode支持的书写系统(脚本)

  2. 对于每个脚本,定义一个包含该脚本字符的字符集

然后我可以通过测试“脚本 X 的字符集的字符是否也足够字符集 A 的成员”来解决“字符集 A 是否支持脚本 X”的问题。如果我对步骤 (1) 中的每个脚本都这样做,我会得到一个受支持脚本的列表。

一二三提供的link引用了一个data file,将Unicode字符映射到它们各自的脚本,这在步骤(1)和(2)中是非常宝贵的。