AllenNLP BERT SRL 输入格式 ("OntoNotes v. 5.0 formatted")

Question

目标是训练BERT SRL on another data set. According to configuration，需要conll-formatted-ontonotes-5.0。

本来，我的数据采用 CoNLL 格式，我将其转换为 GitHub edition of OntoNotes v.5.0 的 conll-formatted-ontonotes-5.0 格式。读取数据有效并且训练似乎有效，除了精度保持为 0。我怀疑 SRL 参数的编码（BOI 或短语？）或列结构（其他 OntoNotes 版本CoNLL 格式在这里有所不同）与预期输入不同。或者，如果角色标签在代码中硬连接，则可能会出现错误。我按照参考数据使用长格式 (ARGM-TMP)，但您经常在其他数据中看到短格式 (AM-TMP)。

问题是这里需要哪个数据集和格式。我猜它是 OntoNotes 5.0 的 CoNLL/Skel 格式之一，恢复了 WORD 列，but

CoNLL 版本似乎没有随 LDC edition of OntoNotes
这似乎不是 OntoNotes 创建者在 GitHub 上提供的 OntoNotes v.5.0 的“conll-formatted-ontonotes-5.0”版本的格式。
作为 PropBank 的一部分，至少有一个其他 CoNLL/Skel 版本的 OntoNotes 5.0 数据。这与另一个的不同之处在于省略了 3 列和谓词的编码。（对于我的部分数据，这是原始格式。）
SrlReader 文档提到了 BIO (IOBES) 编码。实际上，这已在 PropBank 数据的其他 CoNLL 版本中使用，但在上述 OntoNotes 语料库中未。其他此类格式包括 CoNLL-2008 和 CoNLL-2009 格式，以及不同的变体。

在我开始对 SrlReader 进行逆向工程之前，有人手头有数据片段以便我可以相应地准备我的数据吗？

conll-formatted-ontonotes-5.0 我的数据版本（来自 EWT 语料库的样本）：

google/ewt/answers/00/20070404104007AAY1Chs_ans.xml 0   0   where   WRB (TOP(S(SBARQ(WHADVP*)   -   -   -   -   *   (ARGM-LOC*) *   *   -
google/ewt/answers/00/20070404104007AAY1Chs_ans.xml 0   1   can MD  (SQ*    -   -   -   -   *   (ARGM-MOD*) *   *   -
google/ewt/answers/00/20070404104007AAY1Chs_ans.xml 0   2   I   PRP (NP*)   -   -   -   -   *   (ARG0*) *   *   -
google/ewt/answers/00/20070404104007AAY1Chs_ans.xml 0   3   get VB  (VP*    get 01  -   -   *   (V*)    *   *   -
google/ewt/answers/00/20070404104007AAY1Chs_ans.xml 0   4   morcillas   NNS (NP*)   -   -   -   -   *   (ARG1*) *   *   -

Answer 1

“native”格式是CoNLL-2012版下的格式，参见cemantix。org/conll/2012/data。html如何创建。

读取它的 Ontonotes class 然而，在解析“原生”CoNLL-2012 数据时可能会遇到困难，因为 CoNLL-2012 预处理脚本会导致无效的解析树。用NLTK解析自然会出现ValueError比如

ValueError: Tree.read(): expected ')' but got 'end-of-string'
            at index 1427.
                "...LT#.#.) ))"

没有直接的方法在数据层面解决，因为解析的字符串是中间表示，而不是原始数据。如果要处理 CoNLL-2012 数据，则必须捕获 ValueError，参见。 https://github.com/allenai/allennlp/issues/5410.

AllenNLP BERT SRL 输入格式 ("OntoNotes v. 5.0 formatted")

AllenNLP BERT SRL input format ("OntoNotes v. 5.0 formatted")

srl

conll

allennlp