如何编写 POS 正则表达式的 spacy 匹配器
how to write spacy matcher of POS regex
Spacy 有两个我想结合的功能 - part-of-speech (POS) and rule-based matching。
如何将它们巧妙地结合起来?
例如 - 假设输入是一个句子,我想验证它是否满足某些 POS 排序条件 - 例如动词在名词之后(类似于名词**动词正则表达式)。结果应该是真或假。那可行吗?或者匹配器是特定的,如示例
基于规则的匹配可以有 POS 规则吗?
如果没有 - 这是我目前的计划 - 将所有内容收集到一个字符串中并应用正则表达式
import spacy
nlp = spacy.load('en')
#doc = nlp(u'is there any way you can do it')
text=u'what are the main issues'
doc = nlp(text)
concatPos = ''
print(text)
for word in doc:
print(word.text, word.lemma, word.lemma_, word.tag, word.tag_, word.pos, word.pos_)
concatPos += word.text +"_" + word.tag_ + "_" + word.pos_ + "-"
print('-----------')
print(concatPos)
print('-----------')
# output of string- what_WP_NOUN-are_VBP_VERB-the_DT_DET-main_JJ_ADJ-issues_NNS_NOUN-
当然可以,只需使用 POS 属性即可。
import spacy
nlp = spacy.load('en')
from spacy.matcher import Matcher
from spacy.attrs import POS
matcher = Matcher(nlp.vocab)
matcher.add_pattern("Adjective and noun", [{POS: 'ADJ'}, {POS: 'NOUN'}])
doc = nlp(u'what are the main issues')
matches = matcher(doc)
Eyal Shulman 的回答很有帮助,但它会让您硬编码模式匹配器,而不是完全使用正则表达式。
我想使用正则表达式,所以我做了自己的解决方案:
pattern = r'(<VERB>)*(<ADV>)*(<PART>)*(<VERB>)+(<PART>)*'
## create a string with the pos of the sentence
posString = ""
for w in doc[start:end].sent:
posString += "<" + w.pos_ + ">"
lstVerb = []
for m in re.compile(pattern).finditer(posString):
## each m is a verb phrase match
## count the "<" in m to find how many tokens we want
numTokensInGroup = m.group().count('<')
## then find the number of tokens that came before that group.
numTokensBeforeGroup = posString[:m.start()].count('<')
verbPhrase = sentence[numTokensBeforeGroup:numTokensBeforeGroup+numTokensInGroup]
## starting at character offset m.start()
lstVerb.append(verbPhrase)
Spacy 有两个我想结合的功能 - part-of-speech (POS) and rule-based matching。
如何将它们巧妙地结合起来?
例如 - 假设输入是一个句子,我想验证它是否满足某些 POS 排序条件 - 例如动词在名词之后(类似于名词**动词正则表达式)。结果应该是真或假。那可行吗?或者匹配器是特定的,如示例
基于规则的匹配可以有 POS 规则吗?
如果没有 - 这是我目前的计划 - 将所有内容收集到一个字符串中并应用正则表达式
import spacy
nlp = spacy.load('en')
#doc = nlp(u'is there any way you can do it')
text=u'what are the main issues'
doc = nlp(text)
concatPos = ''
print(text)
for word in doc:
print(word.text, word.lemma, word.lemma_, word.tag, word.tag_, word.pos, word.pos_)
concatPos += word.text +"_" + word.tag_ + "_" + word.pos_ + "-"
print('-----------')
print(concatPos)
print('-----------')
# output of string- what_WP_NOUN-are_VBP_VERB-the_DT_DET-main_JJ_ADJ-issues_NNS_NOUN-
当然可以,只需使用 POS 属性即可。
import spacy
nlp = spacy.load('en')
from spacy.matcher import Matcher
from spacy.attrs import POS
matcher = Matcher(nlp.vocab)
matcher.add_pattern("Adjective and noun", [{POS: 'ADJ'}, {POS: 'NOUN'}])
doc = nlp(u'what are the main issues')
matches = matcher(doc)
Eyal Shulman 的回答很有帮助,但它会让您硬编码模式匹配器,而不是完全使用正则表达式。
我想使用正则表达式,所以我做了自己的解决方案:
pattern = r'(<VERB>)*(<ADV>)*(<PART>)*(<VERB>)+(<PART>)*'
## create a string with the pos of the sentence
posString = ""
for w in doc[start:end].sent:
posString += "<" + w.pos_ + ">"
lstVerb = []
for m in re.compile(pattern).finditer(posString):
## each m is a verb phrase match
## count the "<" in m to find how many tokens we want
numTokensInGroup = m.group().count('<')
## then find the number of tokens that came before that group.
numTokensBeforeGroup = posString[:m.start()].count('<')
verbPhrase = sentence[numTokensBeforeGroup:numTokensBeforeGroup+numTokensInGroup]
## starting at character offset m.start()
lstVerb.append(verbPhrase)