如何使 spaCy 匹配不区分大小写

Question

如何让 spaCy 不区分大小写？

是否有任何我应该添加的代码片段或其他内容，因为我无法获得非大写的实体？

import spacy
import pandas as pd

from spacy.pipeline import EntityRuler
nlp = spacy.load('en_core_web_sm', disable = ['ner'])
ruler = nlp.add_pipe("entity_ruler")


flowers = ["rose", "tulip", "african daisy"]
for f in flowers:
    ruler.add_patterns([{"label": "flower", "pattern": f}])
animals = ["cat", "dog", "artic fox"]
for a in animals:
    ruler.add_patterns([{"label": "animal", "pattern": a}])



result={}
doc = nlp("CAT and Artic fox, plant african daisy")
for ent in doc.ents:
        result[ent.label_]=ent.text
df = pd.DataFrame([result])
print(df)

Answer 1

您需要使用 LOWER 创建模式。但是，您还需要考虑多词实体，因此您需要拆分短语并动态构建模式：

import spacy
import pandas as pd

from spacy.pipeline import EntityRuler
nlp = spacy.load('en_core_web_sm', disable = ['ner'])
ruler = nlp.add_pipe("entity_ruler")

patterns = []
flowers = ["rose", "tulip", "african daisy"]
for f in flowers:
    patterns.append({"label": "FLOWER", "pattern": [{'LOWER': w} for w in f.split()]})
animals = ["cat", "dog", "artic fox"]
for a in animals:
    patterns.append({"label": "ANIMAL", "pattern": [{'LOWER': w} for w in a.split()]})

ruler.add_patterns(patterns)

result={}
doc = nlp("CAT and Artic fox, plant african daisy")
for ent in doc.ents:
        result[ent.label_]=ent.text

print([(ent.text, ent.label_) for ent in doc.ents])

输出：

[('CAT', 'ANIMAL'), ('Artic fox', 'ANIMAL'), ('african daisy', 'FLOWER')]

Answer 2

只要所有模式都使用LOWER没问题，你可以继续使用短语模式，并为实体标尺添加phrase_matcher_attr选项。然后你就不用担心对短语进行标记化，如果你有很多模式要匹配，它也会比使用标记模式更快：

import spacy

nlp = spacy.load('en_core_web_sm', disable=['ner'])
ruler = nlp.add_pipe("entity_ruler", config={"phrase_matcher_attr": "LOWER"})

flowers = ["rose", "tulip", "african daisy"]
for f in flowers:
    ruler.add_patterns([{"label": "flower", "pattern": f}])
animals = ["cat", "dog", "artic fox"]
for a in animals:
    ruler.add_patterns([{"label": "animal", "pattern": a}])

doc = nlp("CAT and Artic fox, plant african daisy")
for ent in doc.ents:
    print(ent, ent.label_)

输出：

CAT animal
Artic fox animal
african daisy flower

如何使 spaCy 匹配不区分大小写

How can I make spaCy matches case Insensitive

python

nlp

pandas

spacy