Matcher 正在返回一些重复项
Matcher is returning some duplicates entry
我希望输出为 ["good customer service","great ambience"]
,但我得到的是 ["good customer","good customer service","great ambience"]
,因为模式也与优质客户匹配,但这个短语没有任何意义。我怎样才能删除这些重复项
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
doc = nlp("good customer service and great ambience")
matcher = Matcher(nlp.vocab)
# Create a pattern matching two tokens: adjective followed by one or more noun
pattern = [{"POS": 'ADJ'},{"POS": 'NOUN', "OP": '+'}]
matcher.add("ADJ_NOUN_PATTERN", None,pattern)
matches = matcher(doc)
print("Matches:", [doc[start:end].text for match_id, start, end in matches])
您可以 post 通过根据起始索引对元组进行分组来处理匹配,并且只保留具有最大结束索引的元组:
from itertools import *
#...
matches = matcher(doc)
results = [max(list(group),key=lambda x: x[2]) for key, group in groupby(matches, lambda prop: prop[1])]
print("Matches:", [doc[start:end].text for match_id, start, end in results])
# => Matches: ['good customer service', 'great ambience']
groupby(matches, lambda prop: prop[1])
将按起始索引对匹配项进行分组,此处为 [(5488211386492616699, 0, 2), (5488211386492616699, 0, 3)]
和 (5488211386492616699, 4, 6)
。 max(list(group),key=lambda x: x[2])
将抓取结束索引(值 #3)最大的项目。
Spacy 有一个内置函数可以做到这一点。检查 filter_spans:
文档说:
When spans overlap, the (first) longest span is preferred over shorter spans.
示例:
doc = nlp("This is a sentence.")
spans = [doc[0:2], doc[0:2], doc[0:4]]
filtered = filter_spans(spans)
我希望输出为 ["good customer service","great ambience"]
,但我得到的是 ["good customer","good customer service","great ambience"]
,因为模式也与优质客户匹配,但这个短语没有任何意义。我怎样才能删除这些重复项
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
doc = nlp("good customer service and great ambience")
matcher = Matcher(nlp.vocab)
# Create a pattern matching two tokens: adjective followed by one or more noun
pattern = [{"POS": 'ADJ'},{"POS": 'NOUN', "OP": '+'}]
matcher.add("ADJ_NOUN_PATTERN", None,pattern)
matches = matcher(doc)
print("Matches:", [doc[start:end].text for match_id, start, end in matches])
您可以 post 通过根据起始索引对元组进行分组来处理匹配,并且只保留具有最大结束索引的元组:
from itertools import *
#...
matches = matcher(doc)
results = [max(list(group),key=lambda x: x[2]) for key, group in groupby(matches, lambda prop: prop[1])]
print("Matches:", [doc[start:end].text for match_id, start, end in results])
# => Matches: ['good customer service', 'great ambience']
groupby(matches, lambda prop: prop[1])
将按起始索引对匹配项进行分组,此处为 [(5488211386492616699, 0, 2), (5488211386492616699, 0, 3)]
和 (5488211386492616699, 4, 6)
。 max(list(group),key=lambda x: x[2])
将抓取结束索引(值 #3)最大的项目。
Spacy 有一个内置函数可以做到这一点。检查 filter_spans:
文档说:
When spans overlap, the (first) longest span is preferred over shorter spans.
示例:
doc = nlp("This is a sentence.")
spans = [doc[0:2], doc[0:2], doc[0:4]]
filtered = filter_spans(spans)