当两个词被设置为单独的 'TEXT' 条件对象时,spacy matcher returns 正确答案。为什么?

spacy matcher returns right answer when two words are set as seperate 'TEXT' conditional object only. Why is it?

我正在尝试设置匹配器查找词 'iPhone X'。


import spacy

# Import the Matcher
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
doc = nlp("Upcoming iPhone X release date leaked as Apple reveals pre-orders")

# Initialize the Matcher with the shared vocabulary
matcher = Matcher(nlp.vocab)

# Create a pattern matching two tokens: "iPhone" and "X"
pattern = [{"TEXT": "iPhone"}, {"TEXT": "X"}]

# Add the pattern to the matcher
matcher.add("IPHONE_X_PATTERN", None, pattern)

# Use the matcher on the doc
matches = matcher(doc)
print("Matches:", [doc[start:end].text for match_id, start, end in matches])


# Create a pattern matching two tokens: "iPhone" and "X"
pattern = [{"TEXT": "iPhone X"}]

# Add the pattern to the matcher
matcher.add("IPHONE_X_PATTERN", None, pattern)

为什么第二种方法不起作用?我假设如果我将 'iPhone' 和 'X' 这两个词放在一起,它可能会以相同的方式工作,因为它将中间带有 space 的词视为一个长的唯一词。但是没有。

我能想到的可能原因是, 匹配器条件应该是一个没有空的单词space。 我对吗?还是第二种方法不起作用的其他原因?


答案在于 Spacy 如何标记字符串:

>>> print([t.text for t in doc])
['Upcoming', 'iPhone', 'X', 'release', 'date', 'leaked', 'as', 'Apple', 'reveals', 'pre', '-', 'orders']

如您所见,iPhoneX 是不同的标记。请参阅 Matcher 参考资料:

A pattern added to the Matcher consists of a list of dictionaries. Each dictionary describes one token and its attributes.
