将实体 ID 映射到 SpaCy 3.0 中的字符串

Mapping entity IDs to strings in SpaCy 3.0

我已经使用 spacy 3.0 训练了一个简单的 NER 流水线。训练后,我想获得预测的 IOB 标签列表,以及来自 Doc (doc = nlp(text)) 的其他内容。例如,["O", "O", "B", "I", "O"]

我可以使用

轻松获取 IOB id(整数)
>> doc.to_array("ENT_IOB")
array([2, 2, ..., 2], dtype=uint64)

但是我怎样才能得到 mappings/lookup?

我没有在 doc.vocab.lookups.tables 中找到任何查找表。

我也明白在每个token([token.ent_iob_ for token in doc])处访问ent_iob_也能达到同样的效果,但我想知道有没有更好的方法?

查看 token 文档:

  • ent_iob IOB code of named entity tag. 3 means the token begins an entity, 2 means it is outside an entity, 1 means it is inside an entity, and 0 means no entity tag is set.
  • ent_iob_ IOB code of named entity tag. “B” means the token begins an entity, “I” means it is inside an entity, “O” means it is outside an entity, and "" means no entity tag is set.

因此,您只需使用简单的 iob_map = {0: "", 1: "I", 2: "O", 3: "B"} 字典替换将 ID 映射到名称即可:

doc = nlp("John went to New York in 2010.")
print([x.text for x in doc.ents])
# => ['John', 'New York', '2010']
iob_map = {0: "", 1: "I", 2: "O", 3: "B"}
print(list(map(iob_map.get, doc.to_array("ENT_IOB").tolist())))
# => ['B', 'O', 'O', 'B', 'I', 'O', 'B', 'O']