快速生成并排序完整的编码字典和对应的初级部首

Question

汉字，according to the unihan encoding schema，可以通过其部首索引。

Stanford Word Segmenter has a command that can execute this, as described in their documentation即

java -cp stanford-segmenter-VERSION.jar
edu.stanford.nlp.trees.international.pennchinese.RadicalMap
-infile whitespace_seperated_chinese_characters.input
> each_character_denoted_by_radical.output

我想创建一个综合的 table 汉字，按部首组织，我想我可以使用功能

public static java.util.Set getChars(char ch)

有这个部首的字是什么?

或

public static char getRadical(char ch)

这个字的部首是什么？

但我的问题是，实现这个目标最有效的方法是什么？并进一步以 table、à la this Wikipedia table 的形式输出结果（不完全像 table，但是，我们应该说，暗示 table）。

Stanford 工具使用 CC-CEDIT 字典。我是否可以下载该词典并将其输入？如果可以，怎么做？

也许 Stanford 工具已经将其包含为 part of the code，但如何访问它？

Answer 1

此信息在 RadicalMap source code.

中以您想要的形式编码

查看静态初始化程序：

String[] radLists = {"\u4e00\u4e00\u4e01\u4e02\u4e03...", "...", ..., };

此列表中的每个字符串的第一个字符都是一个部首，其余字符的第一个字符是其主要部首。

它是一个包本地静态变量，因此没有一种完全干净的方式来以编程方式访问它。但是您可以轻松地将其定义从源代码中删除并用于您需要的任何东西。

快速生成并排序完整的编码字典和对应的初级部首

swiftly generate and sort full encoding dictionary and corresponding primary radicals

command-line-interface

character-encoding

stanford-nlp