解析 CoNLL-U 缺失注释(misc)

Parsing CoNLL-U missing annotation (misc)

我正在尝试从此 github Repo 解析 .ConLL 文件,我的解析代码示例:

from io import open
from conllu import parse_tree_incr
import glob
import os

for filename in glob.glob('./licenses-conll-format/22-MIT/MIT_permissionCopy.conll'):
    data_file=open(filename, "r", encoding="utf-8")
    for tokentree in parse_incr(data_file):
        print(tokentree.serialize())

输出:

24  Permission  _   NN  NN  _   27  nsubjpass   _   _
25  is  _   VBZ VBZ _   27  auxpass _   _
26  hereby  _   RB  RB  _   27  advmod  _   _
27  granted _   VBN VBN _   11  rcmod   _   _
28  ,   _   ,   ,   _   27  punct   _   _
29  free    _   JJ  JJ  _   27  advmod  _   _
30  of  _   IN  IN  _   0   erased  _   _
31  charge  _   NN  NN  _   29  prep_of _   _

这似乎缺少原始 .conll 文件中的一些注释(I-PERMISSION、B-PERMISSION 等 ..):

24  Permission  _   NN  NN  _   27  nsubjpass   _   _   B-PERMISSION    COPY
25  is  _   VBZ VBZ _   27  auxpass _   _   I-PERMISSION
26  hereby  _   RB  RB  _   27  advmod  _   _   I-PERMISSION
27  granted _   VBN VBN _   11  rcmod   _   _   I-PERMISSION
28  ,   _   ,   ,   _   27  punct   _   _   O
29  free    _   JJ  JJ  _   27  advmod  _   _   I-PERMISSION
30  of  _   IN  IN  _   0   erased  _   _   I-PERMISSION
31  charge  _   NN  NN  _   29  prep_of _   _   I-PERMISSION
32  ,   _   ,   ,   _   27  punct   _   _   O

关于如何获取所有注释有什么想法吗?

你可以自己指定字段的元组:

fields = ('id', 'form', 'lemma', 'upostag', 'xpostag', 'feats', 'head', 'deprel', 'deps', 'misc', 'rest')
for tokentree in parse_incr(data_file, fields=fields):
    print(tokentree.serialize())

输出:

24  Permission  _   NN  NN  _   27  nsubjpass   _   _   B-PERMISSION
25  is  _   VBZ VBZ _   27  auxpass _   _   I-PERMISSION
26  hereby  _   RB  RB  _   27  advmod  _   _   I-PERMISSION
27  granted _   VBN VBN _   11  rcmod   _   _   I-PERMISSION