Spacy Matcher - 只匹配最长的字符串
Spacy Matcher - Only Match Longest String
我正在尝试使用 spacy 模式匹配器创建名词块。例如,如果我有一句话“冰球混战花了几个小时”。我要return《冰球混战》和《小时》。
我目前有这个:
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
matcher.add("NounChunks", None, [{"POS": "NOUN"}, {"POS": "NOUN", "OP": "*"}, {"POS": "NOUN", "OP": "*"}] )
doc = nlp("The ice hockey scrimmage took hours.")
matches = matcher(doc)
for match_id, start, end in matches:
string_id = nlp.vocab.strings[match_id]
span = doc[start:end]
print(match_id, string_id, start, end, span.text)
但它是 return所有版本的“冰球混战”,而不仅仅是最长的。
12482938965902279598 NounChunks 1 2 ice
12482938965902279598 NounChunks 1 3 ice hockey
12482938965902279598 NounChunks 2 3 hockey
12482938965902279598 NounChunks 1 4 ice hockey scrimmage
12482938965902279598 NounChunks 2 4 hockey scrimmage
12482938965902279598 NounChunks 3 4 scrimmage
12482938965902279598 NounChunks 5 6 hours
在如何定义模式方面我是否遗漏了什么?我只想 return:
12482938965902279598 NounChunks 1 4 ice hockey scrimmage
12482938965902279598 NounChunks 5 6 hours
我不知道有什么内置方法可以过滤掉最长的跨度,但有一个实用函数spacy.util.filter_spans(spans)
可以帮助解决这个问题。它在给定的跨度中选择最长的跨度,如果多个重叠的跨度具有相同的长度,它优先考虑在跨度列表中首先出现的跨度。
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
matcher.add("NounChunks", None, [{"POS": "NOUN", "OP": "+"}] )
doc = nlp("The ice hockey scrimmage took hours.")
matches = matcher(doc)
spans = [doc[start:end] for _, start, end in matches]
print(spacy.util.filter_spans(spans))
输出
[ice hockey scrimmage, hours]
我正在尝试使用 spacy 模式匹配器创建名词块。例如,如果我有一句话“冰球混战花了几个小时”。我要return《冰球混战》和《小时》。 我目前有这个:
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
matcher.add("NounChunks", None, [{"POS": "NOUN"}, {"POS": "NOUN", "OP": "*"}, {"POS": "NOUN", "OP": "*"}] )
doc = nlp("The ice hockey scrimmage took hours.")
matches = matcher(doc)
for match_id, start, end in matches:
string_id = nlp.vocab.strings[match_id]
span = doc[start:end]
print(match_id, string_id, start, end, span.text)
但它是 return所有版本的“冰球混战”,而不仅仅是最长的。
12482938965902279598 NounChunks 1 2 ice
12482938965902279598 NounChunks 1 3 ice hockey
12482938965902279598 NounChunks 2 3 hockey
12482938965902279598 NounChunks 1 4 ice hockey scrimmage
12482938965902279598 NounChunks 2 4 hockey scrimmage
12482938965902279598 NounChunks 3 4 scrimmage
12482938965902279598 NounChunks 5 6 hours
在如何定义模式方面我是否遗漏了什么?我只想 return:
12482938965902279598 NounChunks 1 4 ice hockey scrimmage
12482938965902279598 NounChunks 5 6 hours
我不知道有什么内置方法可以过滤掉最长的跨度,但有一个实用函数spacy.util.filter_spans(spans)
可以帮助解决这个问题。它在给定的跨度中选择最长的跨度,如果多个重叠的跨度具有相同的长度,它优先考虑在跨度列表中首先出现的跨度。
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
matcher.add("NounChunks", None, [{"POS": "NOUN", "OP": "+"}] )
doc = nlp("The ice hockey scrimmage took hours.")
matches = matcher(doc)
spans = [doc[start:end] for _, start, end in matches]
print(spacy.util.filter_spans(spans))
输出
[ice hockey scrimmage, hours]