Spacy 中的 PhraseMatcher 是否仍然适用于错误的标记化？

Question

https://spacy.io/usage/rule-based-matching#phrasematcher

对于这个例子：

    nlp = spacy.load("en_core_web_sm")
matcher = PhraseMatcher(nlp.vocab)
terms = ["Barack Obama", "Angela Merkel", "Washington, D.C."]
# Only run nlp.make_doc to speed things up
patterns = [nlp.make_doc(text) for text in terms]
matcher.add("TerminologyList", patterns)

doc = nlp("He lives in Washington, D.C. and Boston. ")

医生说：

Since spaCy is used for processing both the patterns and the text to be matched, you won’t have to worry about specific tokenization – for example, you can simply pass in nlp("Washington, D.C.") and won’t have to write a complex token pattern covering the exact tokenization of the term.

之所以'Washington, D.C.'能够与文本成功匹配而不用担心标记化，是因为'Washington, D.C.'的标记化是正确的。假设标记化如下所示：

['in', 'Washington', ',',  'D.', 'C. and', 'Boston', '.']

我的问题是，如果'C. and'被标记为一个标记，'Washington, D.C.'的匹配是否仍然成功？

Answer 1

只要短语的开头和结尾是标记边界，Washington, D.C. 在内部如何标记并不重要。在您的示例中，它不匹配，因为 C. and 是一个标记（出于某些不寻常的原因？）。

所以你也无法匹配 Washing 或 ton D. 并且你无法匹配 D.C（没有 .）如果 D.C. 是一个令牌。

Spacy 中的 PhraseMatcher 是否仍然适用于错误的标记化？

Does the PhraseMatcher in Spacy still work for wrong tokenization?

spacy

spacy-3