Python 的 SpaCy EntityRuler 没有 return 我得到任何结果

Question

我想制作能够识别组织名称的 SpaCy 模型。每个组织名称有 1 到 4 个单词，可以是标题或大写。我添加了 3500 多个这样的组织名称：

patterns = []
for organisation in organisations_list:
    patterns.append({"label": "ORG", "pattern": organisation.strip()})

所以现在我有一个看起来像这样的模式列表：

for p in patterns:
   print(p)

结果：

{'label': 'ORG', 'pattern': 'BLS AG'}
{'label': 'ORG', 'pattern': 'Chemins de fer du Jura'}
{'label': 'ORG', 'pattern': 'Comlux'}
{'label': 'ORG', 'pattern': 'CRH Gétaz Group'}
{'label': 'ORG', 'pattern': 'DKSH Management AG'}
{'label': 'ORG', 'pattern': 'Ferdinand Steck Maschinenfabrik'}
{'label': 'ORG', 'pattern': 'Galenica'}
{'label': 'ORG', 'pattern': 'Givaudan'}
{'label': 'ORG', 'pattern': 'Heliswiss'}
{'label': 'ORG', 'pattern': 'Jet Aviation'}
{'label': 'ORG', 'pattern': 'Kolmar'}
...
...

所以模式 object 看起来像这样：

patterns = [{'label': 'ORG', 'pattern': 'BLS AG'}
{'label': 'ORG', 'pattern': 'Chemins de fer du Jura'}
{'label': 'ORG', 'pattern': 'Comlux'}
{'label': 'ORG', 'pattern': 'CRH Gétaz Group'}
{'label': 'ORG', 'pattern': 'DKSH Management AG'}
{'label': 'ORG', 'pattern': 'Ferdinand Steck Maschinenfabrik'}
{'label': 'ORG', 'pattern': 'Galenica'}
{'label': 'ORG', 'pattern': 'Givaudan'}
{'label': 'ORG', 'pattern': 'Heliswiss'}
{'label': 'ORG', 'pattern': 'Jet Aviation'}
{'label': 'ORG', 'pattern': 'Kolmar'}....]

然后我创建了一个空白模型：

nlp = spacy.blank("en")
nlp.add_pipe('entity_ruler')
ruler.add_patterns(patterns)

然后，我是这样测试的：

for full_text in list_of_texts:
    doc = nlp(full_text)
    print(doc.ents.text, doc.ents.label_)

它不识别任何东西（即使我在一个具有组织确切名称的句子中测试它）。我还尝试使用 entity_ruler 将 tagger 和 parser 添加到我的空白模型中，但它始终相同。

这些是我用于测试的一些文本示例（测试文本中的每个公司名称也是大小写和拼写相同的模式）：

t1 = "I work in company called DKSH Management AG its very good company"
t2 = "I have stayed in Holiday Inn Express and I really liked it"
t3 = "Have you head for company named AKKA Technologies SE"
t4 = "what do you think about ERYTECH Pharma"
t5 = "did you get an email from ESI Group"
t6 = "Esso S.A.F. sent me an email last week"

我做错了什么？我注意到如果我这样做它会起作用：

ruler = EntityRuler(nlp)
ruler.add_patterns(patterns)
nlp = spacy.load("en_core_web_trf")
nlp.add_pipe('entity_ruler', before = 'tagger')
#if i do print(nlp.pipeline) i can see entity_ruler added before tager.

但是我不知道它是否有效是因为我的 entity_ruler 还是因为预训练模型。我已经在 20 个示例文本上对其进行了测试，它在使用 entity_ruler 和不使用它时都给出了相同的结果，所以我不知道它是否工作得更好。

我做错了什么？

Answer 1

您没有正确添加 EntityRuler。您正在从头开始创建一个 EntityRuler 并向其添加规则，然后告诉管道创建一个完全不相关的 EntityRuler。

这是问题代码：

ruler = EntityRuler(nlp)     # ruler 1
ruler.add_patterns(patterns) # ruler 1
nlp = spacy.blank("en")
nlp.add_pipe('entity_ruler') # this creates an unrelated ruler 2

这是你应该做的：

nlp = spacy.blank("en")
ruler = nlp.add_pipe("entity_ruler")
ruler.add_patterns(patterns)

应该可以。

在 spaCy v2 中，创建管道组件的流程是创建对象，然后将其添加到管道中，但在 v3 中，流程是要求管道创建组件，然后使用返回的对象。

根据您更新的示例，下面是使用 EntityRuler 匹配第一句的示例代码。

nlp = spacy.blank("en")
ruler = nlp.add_pipe("entity_ruler")
patterns = [
  {"label": "ORG", "pattern": "DKSH Management AG"},
  {"label": "ORG", "pattern": "Some other company"},
]
ruler.add_patterns(patterns)

doc = nlp("I work in company called DKSH Management AG its very good company")
print([(ent.text, ent.label_) for ent in doc.ents])
# output: [('DKSH Management AG', 'ORG')]

这是否阐明了您应该如何构建代码？

查看更新后的问题代码，空白模型的代码几乎是正确的，但请注意 add_pipe returns EntityRuler 对象。 你应该将您的模式添加到该对象。

Python 的 SpaCy EntityRuler 没有 return 我得到任何结果

Python's SpaCy EntityRuler does not return me any results

python

nlp

spacy