Spacy 中的 PhraseMatcher 是否仍然适用于错误的标记化?
Does the PhraseMatcher in Spacy still work for wrong tokenization?
https://spacy.io/usage/rule-based-matching#phrasematcher
对于这个例子:
nlp = spacy.load("en_core_web_sm")
matcher = PhraseMatcher(nlp.vocab)
terms = ["Barack Obama", "Angela Merkel", "Washington, D.C."]
# Only run nlp.make_doc to speed things up
patterns = [nlp.make_doc(text) for text in terms]
matcher.add("TerminologyList", patterns)
doc = nlp("He lives in Washington, D.C. and Boston. ")
医生说:
Since spaCy is used for processing both the patterns and the text to be matched, you won’t have to worry about specific tokenization – for example, you can simply pass in nlp("Washington, D.C.") and won’t have to write a complex token pattern covering the exact tokenization of the term.
之所以'Washington, D.C.'能够与文本成功匹配而不用担心标记化,是因为'Washington, D.C.'的标记化是正确的。假设标记化如下所示:
['in', 'Washington', ',', 'D.', 'C. and', 'Boston', '.']
我的问题是,如果'C. and'被标记为一个标记,'Washington, D.C.'的匹配是否仍然成功?
只要短语的开头和结尾是标记边界,Washington, D.C.
在内部如何标记并不重要。在您的示例中,它不匹配,因为 C. and
是一个标记(出于某些不寻常的原因?)。
所以你也无法匹配 Washing
或 ton D.
并且你无法匹配 D.C
(没有 .
)如果 D.C.
是一个令牌。
https://spacy.io/usage/rule-based-matching#phrasematcher
对于这个例子:
nlp = spacy.load("en_core_web_sm")
matcher = PhraseMatcher(nlp.vocab)
terms = ["Barack Obama", "Angela Merkel", "Washington, D.C."]
# Only run nlp.make_doc to speed things up
patterns = [nlp.make_doc(text) for text in terms]
matcher.add("TerminologyList", patterns)
doc = nlp("He lives in Washington, D.C. and Boston. ")
医生说:
Since spaCy is used for processing both the patterns and the text to be matched, you won’t have to worry about specific tokenization – for example, you can simply pass in nlp("Washington, D.C.") and won’t have to write a complex token pattern covering the exact tokenization of the term.
之所以'Washington, D.C.'能够与文本成功匹配而不用担心标记化,是因为'Washington, D.C.'的标记化是正确的。假设标记化如下所示:
['in', 'Washington', ',', 'D.', 'C. and', 'Boston', '.']
我的问题是,如果'C. and'被标记为一个标记,'Washington, D.C.'的匹配是否仍然成功?
只要短语的开头和结尾是标记边界,Washington, D.C.
在内部如何标记并不重要。在您的示例中,它不匹配,因为 C. and
是一个标记(出于某些不寻常的原因?)。
所以你也无法匹配 Washing
或 ton D.
并且你无法匹配 D.C
(没有 .
)如果 D.C.
是一个令牌。