如何在 spacy 的 DependecyMatcher 中使用自定义命名实体数据集？

Question

假设我创建了一个 spacy 模型或数据集，其中包含来自特定文本的所有命名实体，标记为 PERSON。如果我需要提取“person”-“root verb”对，如何在 DependencyMatcher 中应用它？换句话说，我希望 DependencyMatcher 不使用其识别人名的自定义模型，而是使用我已经制作的名称数据集。

import spacy
from spacy.matcher import DependencyMatcher

nlp = spacy.load("en_core_web_lg")
def on_match(matcher, doc, id, matches):
    return matches

patterns = [
        [#pattern1 (sur)name Jack lived
        {
            "RIGHT_ID": "person",
            "RIGHT_ATTRS": {"ENT_TYPE": "PERSON", "DEP": "nsubj"}
        },
        {
            "LEFT_ID": "person",
            "REL_OP": "<",
            "RIGHT_ID": "verb",
            "RIGHT_ATTRS": {"POS": "VERB"}
        }
        ]
matcher = DependencyMatcher(nlp.vocab)
matcher.add("PERVERB", patterns, on_match=on_match)

Answer 1

DependencyMatcher 没有“识别人名的自定义模型”——这是您加载的管道中的 NER 组件。在这种情况下，您应该：

禁用 NER 组件
使用 EntityRuler 来标记名称
照常使用 DependencyMatcher

要禁用组件，您可以这样做：

nlp = spacy.load("en_core_web_lg", disable=["ner"])

要将列表中的名称与 EntityRuler 匹配，请参阅 the rule-based matching docs。

请注意，以上假设您有一个名称列表，而不是在句子中注释什么是名称。如果您有显式注释的名称，那么您可以跳过第 2 步 - 禁用 NER 组件就足以只保留现有的注释。

如何在 spacy 的 DependecyMatcher 中使用自定义命名实体数据集？

How to use custom named enitities dataset in spacy's DependecyMatcher?

python

dependencies

named-entity-recognition

spacy

spacy-3