Spacy Matcher 日期模式将匹配连字符，但不匹配正斜杠？

Question

我找不到 pattern_2 在下面的代码中起作用的任何原因，但 pattern_1 却不起作用。为什么匹配器能够找到带有连字符的日期模式，而不是带有正斜杠的日期模式？

import spacy
from spacy.tokens.doc import Doc
from spacy.matcher import Matcher
nlp = spacy.load('en_core_web_sm')

matcher = Matcher(nlp.vocab)  
doc: Doc = nlp('4/15/2021 4-15-2021')

pattern_1 = [{'IS_DIGIT': True}, {'ORTH': '/'}, {'IS_DIGIT': True}, {'ORTH': '/'}, {'IS_DIGIT': True}]
pattern_2 = [{'IS_DIGIT': True}, {'ORTH': '-'}, {'IS_DIGIT': True}, {'ORTH': '-'}, {'IS_DIGIT': True}]

matcher.add('DATE_PATTERN_1', None, pattern_1)
matcher.add('DATE_PATTERN_2', None, pattern_2)
matches = matcher(doc)
print(f"matches = {matches}")

for match_id, start, end in matches:
    matched_span = doc[start:end]
    print(matched_span.text)

输出：

matches = [(93485516188963487, 1, 6)]

4-15-2021

Answer 1

第一个 4/15/2021 被解析为单个标记：

print([t for t in doc])
# => [4/15/2021, 4, -, 15, -, 2021]

您可以使用基于正则表达式的模式来检测这种标记：

pattern_1 = [{'TEXT':{'REGEX':r'^\d{1,2}/\d{1,2}/\d{2}(?:\d{2})?$'}}]

那么，结果会是这样的

print(f"matches = {matches}")
# => matches = [(2279607876847626059, 0, 1), (93485516188963487, 1, 6)]
for match_id, start, end in matches:
    matched_span = doc[start:end]
    print(matched_span.text)

# => 4/15/2021
#    4-15-2021

正则表达式匹配

^ - 字符串开头
\d{1,2} - 一位或两位数
/ - 一个 / 字符
\d{1,2}/\d{2} - 一位或两位数，/，两位数
(?:\d{2})? - 可选的两位数字序列
$ - 字符串结尾（此处为标记）。

Spacy Matcher 日期模式将匹配连字符，但不匹配正斜杠？

Spacy Matcher date pattern will match hyphens, but not forward slashes?

python

spacy