在多个标记中使用正则表达式在 Entity Ruler Spacy 中添加新模式
Add new pattern in Entity Ruler Spacy with regex in multiple tokens
我有这段代码,如果我尝试搜索确切的词,效果很好。
from spacy.lang.en import English
import spacy
#nlp = spacy.load("en_core_web_sm")
nlp = spacy.load("en_core_web_sm", disable=["tagger", "attribute_ruler", "lemmatizer","ner"])
ruler = nlp.add_pipe("entity_ruler")
patterns = [{"label": "ORG", "pattern": "Google"},
{"label": "COLOR", "pattern": "yellow"},
{"label": "COLOR", "pattern": "red"},
{"label": "GPE", "pattern": [{"LOWER": "san"}, {"LOWER": "francisco"}]},
{"label": "DIN", "pattern": [{"TEXT" : {"REGEX": "DIN\d"}}]},
{"label": "DIAM", "pattern": [{"TEXT" : {"REGEX": "diameter\d"}}]},
{"label": "MATERIAL", "pattern": [{"LOWER": "zinc"}, {"LOWER": "plated"}]},
{"label": "MATERIAL", "pattern": [{"LOWER": "stainless"}, {"LOWER": "steel"}]},
{"label": "BRAND", "pattern": [{"LOWER": "cubitron"},{"LOWER": "ii"}]}
]
ruler.add_patterns(patterns)
doc = nlp("Google red yellow DIN 789 opening its first big zinc plated ffice in San Francisco")
print([(ent.text, ent.label_) for ent in doc.ents])
但正则表达式不适用于整个句子,而只适用于每个标记。
我尝试添加类似这样的内容来添加新实体,但它仍未在输出中显示新标签 DIN。
from spacy.tokens import Span
doc = nlp("Google red yellow DIN 180 opening its first big zinc plated ffice in San Francisco")
pattern = r"DIN\s\d"
original_ents = list(doc.ents)
mwt_ents = []
for match in re.finditer(pattern, doc.text):
start, end = match.span()
span = doc.char_span(start, end)
if span is not None:
mwt_ents.append((span.start, span.end, span.text))
for ent in mwt_ents:
start, end, name = ent
per_ent = Span(doc, start, end, label="DIN")
original_ents.append(per_ent)
doc.ents = original_ents
from spacy.util import filter_spans
filtered = filter_spans(original_ents)
doc.ents = filtered
for ent in doc.ents:
print (ent.text, ent.label_)
我到底做错了什么?如何将基于在整个输入中搜索的正则表达式的新规则添加到 nlp 模型中?
谢谢!!
由于您的正则表达式仅适用于数字标记,因此只需向您的模式添加一个新标记即可。
[{"LOWER" : "diameter"}, {"IS_DIGIT": True}]
How can I add to the nlp model new rule based on regex that searches in the whole input?
匹配器不支持这一点。如果你想对整个输入使用正则表达式,你可以自己做并直接添加跨度,你不需要匹配器。
我有这段代码,如果我尝试搜索确切的词,效果很好。
from spacy.lang.en import English
import spacy
#nlp = spacy.load("en_core_web_sm")
nlp = spacy.load("en_core_web_sm", disable=["tagger", "attribute_ruler", "lemmatizer","ner"])
ruler = nlp.add_pipe("entity_ruler")
patterns = [{"label": "ORG", "pattern": "Google"},
{"label": "COLOR", "pattern": "yellow"},
{"label": "COLOR", "pattern": "red"},
{"label": "GPE", "pattern": [{"LOWER": "san"}, {"LOWER": "francisco"}]},
{"label": "DIN", "pattern": [{"TEXT" : {"REGEX": "DIN\d"}}]},
{"label": "DIAM", "pattern": [{"TEXT" : {"REGEX": "diameter\d"}}]},
{"label": "MATERIAL", "pattern": [{"LOWER": "zinc"}, {"LOWER": "plated"}]},
{"label": "MATERIAL", "pattern": [{"LOWER": "stainless"}, {"LOWER": "steel"}]},
{"label": "BRAND", "pattern": [{"LOWER": "cubitron"},{"LOWER": "ii"}]}
]
ruler.add_patterns(patterns)
doc = nlp("Google red yellow DIN 789 opening its first big zinc plated ffice in San Francisco")
print([(ent.text, ent.label_) for ent in doc.ents])
但正则表达式不适用于整个句子,而只适用于每个标记。
我尝试添加类似这样的内容来添加新实体,但它仍未在输出中显示新标签 DIN。
from spacy.tokens import Span
doc = nlp("Google red yellow DIN 180 opening its first big zinc plated ffice in San Francisco")
pattern = r"DIN\s\d"
original_ents = list(doc.ents)
mwt_ents = []
for match in re.finditer(pattern, doc.text):
start, end = match.span()
span = doc.char_span(start, end)
if span is not None:
mwt_ents.append((span.start, span.end, span.text))
for ent in mwt_ents:
start, end, name = ent
per_ent = Span(doc, start, end, label="DIN")
original_ents.append(per_ent)
doc.ents = original_ents
from spacy.util import filter_spans
filtered = filter_spans(original_ents)
doc.ents = filtered
for ent in doc.ents:
print (ent.text, ent.label_)
我到底做错了什么?如何将基于在整个输入中搜索的正则表达式的新规则添加到 nlp 模型中? 谢谢!!
由于您的正则表达式仅适用于数字标记,因此只需向您的模式添加一个新标记即可。
[{"LOWER" : "diameter"}, {"IS_DIGIT": True}]
How can I add to the nlp model new rule based on regex that searches in the whole input?
匹配器不支持这一点。如果你想对整个输入使用正则表达式,你可以自己做并直接添加跨度,你不需要匹配器。