Matcher 正在返回一些重复项

Matcher is returning some duplicates entry

我希望输出为 ["good customer service","great ambience"],但我得到的是 ["good customer","good customer service","great ambience"],因为模式也与优质客户匹配,但这个短语没有任何意义。我怎样才能删除这些重复项

import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
doc = nlp("good customer service and great ambience")
matcher = Matcher(nlp.vocab)

# Create a pattern matching two tokens: adjective followed by one or more noun
 pattern = [{"POS": 'ADJ'},{"POS": 'NOUN', "OP": '+'}]

matcher.add("ADJ_NOUN_PATTERN", None,pattern)

matches = matcher(doc)
print("Matches:", [doc[start:end].text for match_id, start, end in matches])

您可以 post 通过根据起始索引对元组进行分组来处理匹配,并且只保留具有最大结束索引的元组:

from itertools import *

#...

matches = matcher(doc)
results = [max(list(group),key=lambda x: x[2]) for key, group in groupby(matches, lambda prop: prop[1])]    
print("Matches:", [doc[start:end].text for match_id, start, end in results])
# => Matches: ['good customer service', 'great ambience']

groupby(matches, lambda prop: prop[1]) 将按起始索引对匹配项进行分组,此处为 [(5488211386492616699, 0, 2), (5488211386492616699, 0, 3)](5488211386492616699, 4, 6)max(list(group),key=lambda x: x[2]) 将抓取结束索引(值 #3)最大的项目。

Spacy 有一个内置函数可以做到这一点。检查 filter_spans:

文档说:

When spans overlap, the (first) longest span is preferred over shorter spans.

示例:

doc = nlp("This is a sentence.")
spans = [doc[0:2], doc[0:2], doc[0:4]]
filtered = filter_spans(spans)