Spacy 中的 PhraseMatcher 是否仍然适用于错误的标记化?

Does the PhraseMatcher in Spacy still work for wrong tokenization?

https://spacy.io/usage/rule-based-matching#phrasematcher

对于这个例子:

    nlp = spacy.load("en_core_web_sm")
matcher = PhraseMatcher(nlp.vocab)
terms = ["Barack Obama", "Angela Merkel", "Washington, D.C."]
# Only run nlp.make_doc to speed things up
patterns = [nlp.make_doc(text) for text in terms]
matcher.add("TerminologyList", patterns)

doc = nlp("He lives in Washington, D.C. and Boston. ")

医生说:

Since spaCy is used for processing both the patterns and the text to be matched, you won’t have to worry about specific tokenization – for example, you can simply pass in nlp("Washington, D.C.") and won’t have to write a complex token pattern covering the exact tokenization of the term.

之所以'Washington, D.C.'能够与文本成功匹配而不用担心标记化,是因为'Washington, D.C.'的标记化是正确的。假设标记化如下所示:

['in', 'Washington', ',',  'D.', 'C. and', 'Boston', '.']

我的问题是,如果'C. and'被标记为一个标记,'Washington, D.C.'的匹配是否仍然成功?

只要短语的开头和结尾是标记边界,Washington, D.C. 在内部如何标记并不重要。在您的示例中,它不匹配,因为 C. and 是一个标记(出于某些不寻常的原因?)。

所以你也无法匹配 Washington D. 并且你无法匹配 D.C(没有 .)如果 D.C. 是一个令牌。