在多个标记中使用正则表达式在 Entity Ruler Spacy 中添加新模式

Add new pattern in Entity Ruler Spacy with regex in multiple tokens

我有这段代码,如果我尝试搜索确切的词,效果很好。

from spacy.lang.en import English
import spacy

#nlp = spacy.load("en_core_web_sm")
nlp = spacy.load("en_core_web_sm", disable=["tagger", "attribute_ruler", "lemmatizer","ner"])
ruler = nlp.add_pipe("entity_ruler")
patterns = [{"label": "ORG", "pattern": "Google"},
            {"label": "COLOR", "pattern": "yellow"},
            {"label": "COLOR", "pattern": "red"},
            {"label": "GPE", "pattern": [{"LOWER": "san"}, {"LOWER": "francisco"}]},
            {"label": "DIN", "pattern": [{"TEXT" : {"REGEX": "DIN\d"}}]},

            {"label": "DIAM", "pattern": [{"TEXT" : {"REGEX": "diameter\d"}}]},  
            {"label": "MATERIAL", "pattern": [{"LOWER": "zinc"}, {"LOWER": "plated"}]},
            {"label": "MATERIAL", "pattern": [{"LOWER": "stainless"}, {"LOWER": "steel"}]},
            
            
            {"label": "BRAND", "pattern": [{"LOWER": "cubitron"},{"LOWER": "ii"}]}            
            
           ]
ruler.add_patterns(patterns)

doc = nlp("Google red yellow DIN 789 opening its first big zinc plated ffice in San Francisco")
print([(ent.text, ent.label_) for ent in doc.ents])

但正则表达式不适用于整个句子,而只适用于每个标记。

我尝试添加类似这样的内容来添加新实体,但它仍未在输出中显示新标签 DIN。

from spacy.tokens import Span

doc = nlp("Google red yellow DIN 180 opening its first big zinc plated ffice in San Francisco")

pattern = r"DIN\s\d"
original_ents = list(doc.ents) 
mwt_ents = []
for match in re.finditer(pattern, doc.text):
   start, end = match.span()
   span = doc.char_span(start, end)
   if span is not None:
       mwt_ents.append((span.start, span.end, span.text))
       
for ent in mwt_ents:
   start, end, name = ent
   per_ent = Span(doc, start, end, label="DIN")
   original_ents.append(per_ent)

doc.ents = original_ents

from spacy.util import filter_spans
filtered = filter_spans(original_ents)
doc.ents = filtered
for ent in doc.ents:
   print (ent.text, ent.label_)

我到底做错了什么?如何将基于在整个输入中搜索的正则表达式的新规则添加到 nlp 模型中? 谢谢!!

由于您的正则表达式仅适用于数字标记,因此只需向您的模式添加一个新标记即可。

[{"LOWER" : "diameter"}, {"IS_DIGIT": True}]

How can I add to the nlp model new rule based on regex that searches in the whole input?

匹配器不支持这一点。如果你想对整个输入使用正则表达式,你可以自己做并直接添加跨度,你不需要匹配器。