如何使 spaCy 匹配不区分大小写
How can I make spaCy matches case Insensitive
如何让 spaCy 不区分大小写?
是否有任何我应该添加的代码片段或其他内容,因为我无法获得非大写的实体?
import spacy
import pandas as pd
from spacy.pipeline import EntityRuler
nlp = spacy.load('en_core_web_sm', disable = ['ner'])
ruler = nlp.add_pipe("entity_ruler")
flowers = ["rose", "tulip", "african daisy"]
for f in flowers:
ruler.add_patterns([{"label": "flower", "pattern": f}])
animals = ["cat", "dog", "artic fox"]
for a in animals:
ruler.add_patterns([{"label": "animal", "pattern": a}])
result={}
doc = nlp("CAT and Artic fox, plant african daisy")
for ent in doc.ents:
result[ent.label_]=ent.text
df = pd.DataFrame([result])
print(df)
您需要使用 LOWER
创建模式。但是,您还需要考虑多词实体,因此您需要拆分短语并动态构建模式:
import spacy
import pandas as pd
from spacy.pipeline import EntityRuler
nlp = spacy.load('en_core_web_sm', disable = ['ner'])
ruler = nlp.add_pipe("entity_ruler")
patterns = []
flowers = ["rose", "tulip", "african daisy"]
for f in flowers:
patterns.append({"label": "FLOWER", "pattern": [{'LOWER': w} for w in f.split()]})
animals = ["cat", "dog", "artic fox"]
for a in animals:
patterns.append({"label": "ANIMAL", "pattern": [{'LOWER': w} for w in a.split()]})
ruler.add_patterns(patterns)
result={}
doc = nlp("CAT and Artic fox, plant african daisy")
for ent in doc.ents:
result[ent.label_]=ent.text
print([(ent.text, ent.label_) for ent in doc.ents])
输出:
[('CAT', 'ANIMAL'), ('Artic fox', 'ANIMAL'), ('african daisy', 'FLOWER')]
只要所有模式都使用LOWER
没问题,你可以继续使用短语模式,并为实体标尺添加phrase_matcher_attr
选项。然后你就不用担心对短语进行标记化,如果你有很多模式要匹配,它也会比使用标记模式更快:
import spacy
nlp = spacy.load('en_core_web_sm', disable=['ner'])
ruler = nlp.add_pipe("entity_ruler", config={"phrase_matcher_attr": "LOWER"})
flowers = ["rose", "tulip", "african daisy"]
for f in flowers:
ruler.add_patterns([{"label": "flower", "pattern": f}])
animals = ["cat", "dog", "artic fox"]
for a in animals:
ruler.add_patterns([{"label": "animal", "pattern": a}])
doc = nlp("CAT and Artic fox, plant african daisy")
for ent in doc.ents:
print(ent, ent.label_)
输出:
CAT animal
Artic fox animal
african daisy flower
如何让 spaCy 不区分大小写?
是否有任何我应该添加的代码片段或其他内容,因为我无法获得非大写的实体?
import spacy
import pandas as pd
from spacy.pipeline import EntityRuler
nlp = spacy.load('en_core_web_sm', disable = ['ner'])
ruler = nlp.add_pipe("entity_ruler")
flowers = ["rose", "tulip", "african daisy"]
for f in flowers:
ruler.add_patterns([{"label": "flower", "pattern": f}])
animals = ["cat", "dog", "artic fox"]
for a in animals:
ruler.add_patterns([{"label": "animal", "pattern": a}])
result={}
doc = nlp("CAT and Artic fox, plant african daisy")
for ent in doc.ents:
result[ent.label_]=ent.text
df = pd.DataFrame([result])
print(df)
您需要使用 LOWER
创建模式。但是,您还需要考虑多词实体,因此您需要拆分短语并动态构建模式:
import spacy
import pandas as pd
from spacy.pipeline import EntityRuler
nlp = spacy.load('en_core_web_sm', disable = ['ner'])
ruler = nlp.add_pipe("entity_ruler")
patterns = []
flowers = ["rose", "tulip", "african daisy"]
for f in flowers:
patterns.append({"label": "FLOWER", "pattern": [{'LOWER': w} for w in f.split()]})
animals = ["cat", "dog", "artic fox"]
for a in animals:
patterns.append({"label": "ANIMAL", "pattern": [{'LOWER': w} for w in a.split()]})
ruler.add_patterns(patterns)
result={}
doc = nlp("CAT and Artic fox, plant african daisy")
for ent in doc.ents:
result[ent.label_]=ent.text
print([(ent.text, ent.label_) for ent in doc.ents])
输出:
[('CAT', 'ANIMAL'), ('Artic fox', 'ANIMAL'), ('african daisy', 'FLOWER')]
只要所有模式都使用LOWER
没问题,你可以继续使用短语模式,并为实体标尺添加phrase_matcher_attr
选项。然后你就不用担心对短语进行标记化,如果你有很多模式要匹配,它也会比使用标记模式更快:
import spacy
nlp = spacy.load('en_core_web_sm', disable=['ner'])
ruler = nlp.add_pipe("entity_ruler", config={"phrase_matcher_attr": "LOWER"})
flowers = ["rose", "tulip", "african daisy"]
for f in flowers:
ruler.add_patterns([{"label": "flower", "pattern": f}])
animals = ["cat", "dog", "artic fox"]
for a in animals:
ruler.add_patterns([{"label": "animal", "pattern": a}])
doc = nlp("CAT and Artic fox, plant african daisy")
for ent in doc.ents:
print(ent, ent.label_)
输出:
CAT animal
Artic fox animal
african daisy flower