What markup languages are typically used for annotating information extraction corpora
我正在构建用于提取特定类型信息的信息提取语料库,并且我正在尝试确定注释实体的最佳方式。我发现 IEER 语料库为此使用 SGML 标记元素 ENAMEX
标记(如此处所述:http://itl.nist.gov/iaui/894.02/related_projects/muc/proceedings/ne_task.html)。由于该文档是在 1997 年编写的,我猜想使用这种基于 SGML 的方法已经过时了,并且必须有更好的方法来做到这一点,例如使用 OWL、RDF 或 XML。是否有更新的行业标准来标注信息提取语料库?
brat is the new classic in terms of annotating language resources. It has it's own standoff annotation standard. There is also the Anafora tool which also has it's own XML-based standard. The UIMA-based tools usually use a CAS standard (but bad documentation). You should also look at the native GATE XML format.
如果您要编码的信息足够简单,比如命名实体类型,您甚至可以采用表格格式,例如 CoNLL.
如果其中 none 满足您的要求,只需实施适合它们的任何内容即可。
NLTK 书 (chapter 07, paragraph: Representing Chunks: Tags vs Trees) 指出:
The most widespread file representation uses IOB tags.
[...] each token is tagged with one of three special chunk tags, I (inside), O (outside), or B (begin). [...] The B and I tags are suffixed with the chunk type, e.g. B-NP, I-NP
saw VBD O
the DT B-NP
little JJ I-NP
yellow JJ I-NP
dog NN I-NP
维基百科在 IOB format 上有一个页面。
Stanford NLP apparently 也支持它。
spaCy 使用略有不同的 BILUO format.
我正在构建用于提取特定类型信息的信息提取语料库,并且我正在尝试确定注释实体的最佳方式。我发现 IEER 语料库为此使用 SGML 标记元素 ENAMEX
标记(如此处所述:http://itl.nist.gov/iaui/894.02/related_projects/muc/proceedings/ne_task.html)。由于该文档是在 1997 年编写的,我猜想使用这种基于 SGML 的方法已经过时了,并且必须有更好的方法来做到这一点,例如使用 OWL、RDF 或 XML。是否有更新的行业标准来标注信息提取语料库?
brat is the new classic in terms of annotating language resources. It has it's own standoff annotation standard. There is also the Anafora tool which also has it's own XML-based standard. The UIMA-based tools usually use a CAS standard (but bad documentation). You should also look at the native GATE XML format.
如果您要编码的信息足够简单,比如命名实体类型,您甚至可以采用表格格式,例如 CoNLL.
如果其中 none 满足您的要求,只需实施适合它们的任何内容即可。
NLTK 书 (chapter 07, paragraph: Representing Chunks: Tags vs Trees) 指出:
The most widespread file representation uses IOB tags.
[...] each token is tagged with one of three special chunk tags, I (inside), O (outside), or B (begin). [...] The B and I tags are suffixed with the chunk type, e.g. B-NP, I-NP
We PRP B-NP saw VBD O the DT B-NP little JJ I-NP yellow JJ I-NP dog NN I-NP
维基百科在 IOB format 上有一个页面。
Stanford NLP apparently 也支持它。
spaCy 使用略有不同的 BILUO format.