spaCy phrasmatcher 在某些情况下失败，尽管 POS 标记相同

Question

spaCy PhraseMatcher（使用 LEMMA 属性）只对我的一些句子起作用，但它的失败似乎完全是随机的。我在下面有一个最小的工作示例，试图提取术语 'colorful':

import spacy
nlp = spacy.load('en_core_web_sm')
nlp.add_pipe("lemmatizer")

from spacy.matcher import PhraseMatcher
matcher = PhraseMatcher(nlp.vocab, attr="LEMMA")
matcher.add('colorful', nlp('colorful'))

text1 = "this is the most colorful of the four pieces"
text2 = "colorful and bold"

text1_matches = matcher(nlp(text1))
text2_matches = matcher(nlp(text2))

# These are the results that I get
text1_matches = [(9306951126003165228, 4, 5)]
text2_matches = []

为什么 PhraseMatcher 找到第一个例子而不是第二个？在两者中，'colorful' 词性标记是 (ADJ)，引理是 'colorful'。句子之间还有什么可能不同会导致 PhraseMatcher 找到一个而不是另一个？

我错过了什么？

Answer 1

我建议你更新 spaCy：

pip install spacy --upgrade

下载您的模型：

python -m spacy download en_core_web_sm

并使用此代码：

import spacy
nlp = spacy.load('en_core_web_sm')

from spacy.matcher import PhraseMatcher
matcher = PhraseMatcher(nlp.vocab, attr="LEMMA")
matcher.add('colorful', [nlp('colorful')])

text1 = "this is the most colorful of the four pieces"
text2 = "colorful and bold"

text1_matches = matcher(nlp(text1))
text2_matches = matcher(nlp(text2))

spaCy phrasmatcher 在某些情况下失败，尽管 POS 标记相同

spaCy phrasematcher failing some cases though POS tags the same

python

spacy