spaCy 规则根据先前的标签来注释单词

Question

我正在用 spaCy 构建一个 NER 系统，我想定义一些规则。现在我想让系统确定名字的性别，当名字之前留下像 Mr. 或 Mrs. 这样的词时，如果是这样的话，基本上对名字的注释与注释之前的词相同。

举个例子，如果我有以下句子：“Mr. Johnson goes to Los Angeles”，我的标注器已经可以对“Mr.”这个词进行分类了。为男性，但“约翰逊”一词被标记为未知而非男性。我希望标记器照看这样的结构，并将结构中的第二项注释为与第一项相同。至少这可能吗？

下面的代码现在对我不起作用。

for ent in doc.ents:
if ent.label_ == "Male":
    next_token = doc[ent.end]
    if next_token.text[0].isupper():
        rulerAll.add_patterns([{"label": "Male", "pattern": next_token.text}])
nlp.add_pipe(rulerAll, before="ner")

Answer 1

使用自定义标记器

根据我从你的问题中了解到的情况，如果一个实体被标记为 Male，即你的标记 Mr，那么后续标记将被视为 Male 的一部分令牌，即 Johnson 在您的情况下。

由于您的标注器能够将 Mr. 检测为 Male，我假设您有一个从头开始构建的标注器，并且没有使用 spaCy 的标注器。

然后，可以通过以下方式完成：

代码

import spacy
from spacy.pipeline import EntityRuler

nlp = spacy.load("en_core_web_sm")

ruler = EntityRuler(nlp, overwrite_ents=True)

patterns = [{"label": "MALE_NAME", "pattern": [{"ENT_TYPE": "Male"}, {"TEXT": {"REGEX": "\w+"}} ]},
            {"label": "FEMALE_NAME", "pattern": [{"ENT_TYPE": "Female"}, {"TEXT": {"REGEX": "\w+"}} ]}]
ruler.add_patterns(patterns)
nlp.add_pipe(ruler)

doc = nlp("Mr. Johnson goes to Los Angeles, and Mrs. Smith went to San Francisco.")
print([(ent.text, ent.label_) for ent in doc.ents if ent.label_ in ['MALE_NAME', 'FEMALE_NAME']])

输出

[('Mr. Johnson', 'MALE_NAME'), ('Mrs. Smith', 'FEMALE_NAME')]

使用 spaCy 的内置标记器（为了完整性）

如果使用 spaCy 的 nlp 管道，并使用 spaCy 模型，则可以使用 EntityRuler 和 Matcher.

提取带有前缀的男性和女性名字

实体统治者

import spacy
from spacy.pipeline import EntityRuler

nlp = spacy.load("en_core_web_sm")
ruler = EntityRuler(nlp, overwrite_ents=True)
patterns = [{"label": "MALE_NAME", "pattern": [{"LOWER": {"IN": ["mr", "mr."]}}, {"ENT_TYPE": "PERSON"}]},
            {"label": "FEMALE_NAME", "pattern": [{"LOWER": {"IN": ["mrs", "mrs."]}}, {"ENT_TYPE": "PERSON"}]}]
ruler.add_patterns(patterns)
nlp.add_pipe(ruler)

doc = nlp("Mr. Johnson goes to Los Angeles, and Mrs. Smith went to San Francisco.")
print([(ent.text, ent.label_) for ent in doc.ents if ent.label_ in ['MALE_NAME', 'FEMALE_NAME']])

输出

[('Mr. Johnson', 'MALE_NAME'), ('Mrs. Smith', 'FEMALE_NAME')]

匹配器

import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)

# Create patterns
male_name_pattern = [{"LOWER": {"IN": ["mr", "mr."]}}, {"ENT_TYPE": "PERSON"}]
female_name_pattern = [{"LOWER": {"IN": ["mrs", "mrs."]}}, {"ENT_TYPE": "PERSON"}]

# Add patterns
matcher.add("MALE_NAME", None, male_name_pattern)
matcher.add("FEMALE_NAME", None, female_name_pattern)

doc = nlp("Mr. Johnson goes to Los Angeles, and Mrs. Smith went to San Francisco.")
matches = matcher(doc)
for match_id, start, end in matches:
    # Get string representation of pattern name
    string_id = nlp.vocab.strings[match_id] 
    # The matched span
    span = doc[start:end]  
    print(span.text, string_id)

输出

Mr. Johnson MALE_NAME
Mrs. Smith FEMALE_NAME

spaCy 规则根据先前的标签来注释单词

spaCy Rules to annotate words based on previous label

label

rules

spacy

使用自定义标记器

使用 spaCy 的内置标记器（为了完整性）

实体统治者

匹配器