Spacy 匹配器在完全停止时失败

Question

我是 spacy 的新手，正在尝试使用以下脚本；

import spacy
from spacy.language import Language
from spacy.matcher import Matcher

nlp  = spacy.load('en_core_web_sm')
text = "Google announced a new Pixel at Google I/O. The Google I/O is a great place to get all the updates from Google I/O."

def add_event_ent(matcher, doc, i, matches):
    match_id, start, end = matches[i]
    entity = doc[start:end]
    print(entity.text, start, end)

pattern = [[
  {"TEXT": "Google"}, 
  {"TEXT": "I"}, 
  {"TEXT": "/"}, 
  {"TEXT": "O"}, 
  {"IS_PUNCT": True, "OP": "?"}
]]
matcher = Matcher(nlp.vocab)
matcher.add("Google", pattern, on_match = add_event_ent)

doc = nlp(text)
matcher(doc)

输出：

Google I/O 11 15
[(11578853341595296054, 11, 15)]

我希望它能检测到 Google I/O 的所有 3 次出现，但它没有检测到，我也不完全确定为什么。我尝试了一些不同的方法但没有任何效果，我认为问题已经完全停止了。

我用不同的文本和模式编写了基本上相同的代码片段：

text = "Hello, World! Hello, World! How are you?"
pattern = [[
  {"LOWER": "hello"},
  {"IS_PUNCT": True},
  {"LOWER": "world"}
]]
matcher = Matcher(nlp.vocab)
matcher.add("Google", pattern, on_match = add_event_ent)
doc = nlp(text)
matcher(doc)
for ent in doc.ents:
  print(f"[ENTITY] {ent.text:{15}} {ent.label_}")
print(doc)

输出：

Hello, World 0 3
Hello, World 4 7
Hello, World! Hello, World! How are you?

如您所见，确实有效。

我做了 this viz for the first example in case it helps and this 这表明它不起作用，但我又不确定为什么。

感谢任何帮助，如果我能提供更多信息，请告诉我！

Answer 1

问题来自标记化，O. 标记在此标记文本末尾包含 . 个字符。

您可以将任何 O 标记与可选的尾随标点字符匹配，而不是在 pattern 中定义可选的标点符号。您可以为此使用正则表达式：

pattern = [[
  {"TEXT": "Google"}, 
  {"TEXT": "I"}, 
  {"TEXT": "/"}, 
  {"TEXT": {"REGEX": r"^O(?:_|[^\w\s])?$"}}
]]

输出：

Google I/O. 6 10
Google I/O 11 15
Google I/O. 25 29

此处，{"TEXT": {"REGEX": r"^O(?:_|[^\w\s])?$"}} 将匹配包含一个或两个字符的标记，以 O 开头，然后包含一个可选的标点字符。

^ - 令牌的开始（通常为字符串）
O - O 字符
(?:_|[^\w\s])? - _ 或 (|) 除单词和空白字符以外的任何字符（[^\w\s]，否定字符 class， \w 代表字母、数字和下划线，\s 代表空格），一次或零次（由于 ? 量词）
$ - 标记结束（通常为字符串）

Spacy 匹配器在完全停止时失败

Spacy matcher fails at full stops

python

nlp

spacy