Spacy 3.0 匹配器删除重叠并保留所用模式的信息

Question

是否有更短、更清晰或内置的方法来从 Matcher 中删除重叠匹配结果，同时保留用于匹配的 Pattern 的值？这样你就可以知道哪个模式给出了匹配结果。模式 ID 最初是从匹配器结果中给出的，但是我看到的消除重叠的解决方案会删除 ID 号。

这是我目前使用的解决方案，它有效但有点长：

import spacy
from spacy.lang.en import English
from spacy.matcher import Matcher

text ="United States vs Canada, Canada vs United States, United States vs United Kingdom, Mark Jefferson vs College, Clown vs Jack Cadwell Jr., South America Snakes vs Lopp, United States of America, People vs Jack Spicer"

doc = nlp(text)

#Matcher
matcher=Matcher(nlp.vocab) 
# Two patterns
pattern1 = [{"POS": "PROPN", "OP": "+", "IS_TITLE":True}, {"TEXT": {"REGEX": "vs$"}}, {"POS": "PROPN", "OP": "+", "IS_TITLE":True}]
pattern2 =[{"POS": "ADP"},{"POS": "PROPN", "IS_TITLE":True}]
matcher.add("Games", [pattern1])
matcher.add("States", [pattern2])

#Output stored as list of tuples with the following: (pattern name ID, pattern start, pattern end) 
matches = matcher(doc)

首先，我将结果存储在字典中，以元组列表作为值，模式名称作为键

result = {}
for key, subkey, value in matches:
    result.setdefault(nlp.vocab.strings[key], []).append((subkey,value))
print(result)

打印到：

{'States': [(2, 4), (6, 8), (12, 14), (18, 20), (22, 24), (30, 32), (35, 37), (39, 41)],

 'Games': [(1, 4), (0, 4), (5, 8), (5, 9), (11, 14), (10, 14), (11, 15), (10, 15), (17, 20),
  (16, 20), (21, 24), (21, 25), (21, 26), (38, 41), (38, 42)]}

然后我迭代结果并使用 filter_spans 删除重叠并将开始和结束附加为元组：

for key, value in result.items():
    new_vals = [doc[start:end] for start, end in value]
    val2 =[]
    for span in spacy.util.filter_spans(new_vals):
        val2.append((span.start, span.end))
    result[key]=val2

print(result)

这将打印一个没有重叠的结果列表：

{'States': [(2, 4), (6, 8), (12, 14), (18, 20), (22, 24), (30, 32), (35, 37), (39, 41)], 

'Games': [(0, 4), (5, 9), (10, 15), (16, 20), (21, 26), (38, 42)]}

要获取文本值，只需循环模式并打印跨度：

print ("---Games---")
for start, end in result['Games']:
    span =doc[start:end] 
    print (span.text)

print (" ")

print ("---States---")
for start, end in result['States']:
    span =doc[start:end] 
    print (span.text)

输出：

---Games---
United States vs Canada
Canada vs United States
United States vs United Kingdom
Mark Jefferson vs College
Clown vs Jack Cadwell Jr.
People vs Jack Spicer
 
---States---
vs Canada
vs United
vs United
vs College
vs Jack
vs Lopp
of America
vs Jack

Answer 1

在您的处理中，您可以创建新的跨度来保留标签而不是使用 doc[start:end]，后者不包括标签：

from spacy.tokens import Span
span = Span(doc, start, end, label=match_id)

比 spaCy v3.0+ 更容易使用匹配器选项 as_spans:

import spacy
from spacy.matcher import Matcher

nlp = spacy.blank("en")
matcher = Matcher(nlp.vocab)
matcher.add("A", [[{"ORTH": "a", "OP": "+"}]])
matcher.add("B", [[{"ORTH": "b"}]])

matched_spans = matcher(nlp("a a a a b"), as_spans=True)
for span in spacy.util.filter_spans(matched_spans):
    print(span.label_, ":", span.text)

Spacy 3.0 匹配器删除重叠并保留所用模式的信息

Spacy 3.0 Matcher remove overlaps and preserve the information for the pattern used

nlp

pattern-matching

python-3.x

spacy