有没有一种基于规则的 spacy 匹配模式匹配方法?

is there a method of rule based matching of spacy to match patterns?

我想使用基于规则的匹配 我有一个像每个单词一样的文本 POS:

 text1= "it_PRON is_AUX a_DET beautiful_ADJ  apple_NOUN"

 text2= "it_PRON is_AUX a_DET beautiful_ADJ and_CCONJ big_ADJ apple_NOUN"

所以我想创建一个基于规则的匹配,如果我们有一个 ADJ 后跟名词 (NOUN) 或一个 ADJ 后跟 (PUNCT 或 CCONJ) 后跟一个 ADJ 后跟一个名词 (NOUN)

所以,我想要输出:

text1 = [beautiful_ADJ  apple_NOUN]
text2= [beautiful_ADJ and_CCONJ big_ADJ apple_NOUN]

我尝试这样做,但我没有找到允许这样做的正确模式:

from spacy.matcher import Matcher,PhraseMatcher
import spacy
import spacy
from spacy.matcher import Matcher

matchers = {"first_processing": Matcher(nlp.vocab, validate=True)}
nlp = spacy.load("en_core_web_sm")
pattern = [{},{},{}]  #################################### we must find the right pattern
matchers["first_processing"].add("process_1", None, pattern)

nlp = spacy.load("en_core_web_sm")
doc = nlp("it_PRON is_AUX a_DET beautiful_ADJ and_CCONJ big_ADJ apple_NOUN")
a=matcher(doc)
for match_id, start, end in a:
    text = doc[start:end].text
    print(text)

我不知道 spacy 但这里有一个 re(标准库模块)解决方案:

import re

REGEX = re.compile(r"\w+_ADJ +(?:\w+(?:_CCONJ|_PUNCT) +\w+_ADJ +)*\w+_NOUN")

def extract(s):
    try:
        [extracted] = re.findall(REGEX, s)
    except ValueError:
        return []
    else:
        return extracted.split()
>>> extract("it_PRON is_AUX a_DET beautiful_ADJ and_CCONJ big_ADJ apple_NOUN")
['beautiful_ADJ', 'and_CCONJ', 'big_ADJ', 'apple_NOUN']

>>> extract("it_PRON is_AUX a_DET beautiful_ADJ apple_NOUN")
['beautiful_ADJ', 'apple_NOUN']

我了解到您有 texts = ["it is a beautiful apple", "it is a beautiful and big apple"],并计划定义几个 Matcher 模式来提取您所拥有的文本中的某些 POS 模式。

您可以定义具有所需模式的列表列表,并将其作为第三个以上参数传递给 matcher.add:

from spacy.matcher import Matcher,PhraseMatcher
import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab,validate=True)
patterns = [
    [{'POS': 'ADJ'}, {'POS': 'NOUN'}],
    [{'POS': 'ADJ'}, {'POS': 'CCONJ'}, {'POS': 'ADJ'}, {'POS': 'NOUN'}],
    [{'POS': 'ADJ'}, {'POS': 'PUNCT'}, {'POS': 'ADJ'}, {'POS': 'NOUN'}]
]
matcher.add("process_1", None, *patterns)

texts= ["it is a beautiful apple", "it is a beautiful and big apple"]
for text in texts:
    doc = nlp(text)
    matches = matcher(doc)
    for _, start, end in matches:
        print(doc[start:end].text)
   
# => beautiful apple
#    beautiful and big apple
#    big apple