句子结构识别 - spacy

Sentence Structure identification - spacy

我打算使用 spacy 和 textacy 来识别英语的句子结构。

例如: 猫坐在垫子上——SVO,猫跳起来捡起饼干——SVV0。 猫吃了饼干和饼干。 - 嘘。

该程序应该阅读一个段落,return 每个句子的输出为 SVO、SVOO、SVVO 或其他自定义结构。

目前的努力:

# -*- coding: utf-8 -*-
#!/usr/bin/env python
from __future__ import unicode_literals
# Load Library files
import en_core_web_sm
import spacy
import textacy
nlp = en_core_web_sm.load()
SUBJ = ["nsubj","nsubjpass"] 
VERB = ["ROOT"] 
OBJ = ["dobj", "pobj", "dobj"] 
text = nlp(u'The cat sat on the mat. The cat jumped and picked up the biscuit. The cat ate biscuit and cookies.')
sub_toks = [tok for tok in text if (tok.dep_ in SUBJ) ]
obj_toks = [tok for tok in text if (tok.dep_ in OBJ) ]
vrb_toks = [tok for tok in text if (tok.dep_ in VERB) ]
text_ext = list(textacy.extract.subject_verb_object_triples(text))
print("Subjects:", sub_toks)
print("VERB :", vrb_toks)
print("OBJECT(s):", obj_toks)
print ("SVO:", text_ext)

输出:

(u'Subjects:', [cat, cat, cat])
(u'VERB :', [sat, jumped, ate])
(u'OBJECT(s):', [mat, biscuit, biscuit])
(u'SVO:', [(cat, ate, biscuit), (cat, ate, cookies)])

编辑 1:

我正在概念化的一些方法。

from __future__ import unicode_literals
import spacy,en_core_web_sm
import textacy
nlp = en_core_web_sm.load()
sentence = 'I will go to the mall.'
doc = nlp(sentence)
chk_set = set(['PRP','MD','NN'])
result = chk_set.issubset(t.tag_ for t in doc)
if result == False:
    print "SVO not identified"
elif result == True: # shouldn't do this
    print "SVO"
else:
    print "Others..."

编辑 2:

取得进一步进展

from __future__ import unicode_literals
import spacy,en_core_web_sm
import textacy
nlp = en_core_web_sm.load()
sentence = 'The cat sat on the mat. The cat jumped and picked up the biscuit. The cat ate biscuit and cookies.'
doc = nlp(sentence)
print(" ".join([token.dep_ for token in doc]))

当前输出:

det nsubj ROOT prep det pobj punct det nsubj ROOT cc conj prt det dobj punct det nsubj ROOT dobj cc conj punct

预期输出:

SVO SVVO SVOO

想法是将依存标签分解为简单的主谓和宾语模型。

如果没有其他选项可用,考虑使用正则表达式实现它。但这是我最后的选择。

编辑 3:

经过学习this link,有所进步。

def testSVOs():
    nlp = en_core_web_sm.load()
    tok = nlp("The cat sat on the mat. The cat jumped for the biscuit. The cat ate biscuit and cookies.")
    svos = findSVOs(tok)
    print(svos)

当前输出:

[(u'cat', u'sat', u'mat'), (u'cat', u'jumped', u'biscuit'), (u'cat', u'ate', u'biscuit'), (u'cat', u'ate', u'cookies')]

预期输出:

我期待句子的符号。尽管我能够提取关于如何将其转换为 SVO 符号的 SVO。它更多的是模式识别而不是句子内容本身。

SVO SVO SVOO

Issue 1: The SVO are overwritten. Why?

这是 textacy 问题。这部分效果不是很好,请看这个 blog

Issue 2: How to identify the sentence as SVOO SVO SVVO etc.?

您应该解析依赖关系树。 SpaCy 提供了信息,您只需要编写一组规则将其提取出来,使用 .head.left.right.children 属性。

>>for word in text: 
    print('%10s %5s %10s %10s %s'%(word.text, word.tag_, word.dep_, word.pos_, word.head.text_))

        The    DT        det        DET cat 
        cat    NN      nsubj       NOUN sat 
        sat   VBD       ROOT       VERB sat 
         on    IN       prep        ADP sat 
        the    DT        det        DET mat
        mat    NN       pobj       NOUN on 
          .     .      punct      PUNCT sat 
         of    IN       ROOT        ADP of 
        the    DT        det        DET lab
        art    NN   compound       NOUN lab
        lab    NN       pobj       NOUN of 
          .     .      punct      PUNCT of 
        The    DT        det        DET cat 
        cat    NN      nsubj       NOUN jumped 
     jumped   VBD       ROOT       VERB jumped 
        and    CC         cc      CCONJ jumped 
     picked   VBD       conj       VERB jumped 
         up    RP        prt       PART picked 
        the    DT        det        DET biscuit
    biscuit    NN       dobj       NOUN picked 
          .     .      punct      PUNCT jumped 
        The    DT        det        DET cat 
        cat    NN      nsubj       NOUN ate 
        ate   VBD       ROOT       VERB ate 
    biscuit    NN       dobj       NOUN ate 
        and    CC         cc      CCONJ biscuit 
    cookies   NNS       conj       NOUN biscuit 
          .     .      punct      PUNCT ate 

我建议你看看这个 code,只需将 pobj 添加到 OBJECTS 的列表中,你就会得到你的 SVO 和 SVOO。稍作调整,您也可以获得 SVVO。