句子结构识别 - spacy
Sentence Structure identification - spacy
我打算使用 spacy 和 textacy 来识别英语的句子结构。
例如:
猫坐在垫子上——SVO,猫跳起来捡起饼干——SVV0。
猫吃了饼干和饼干。 - 嘘。
该程序应该阅读一个段落,return 每个句子的输出为 SVO、SVOO、SVVO 或其他自定义结构。
目前的努力:
# -*- coding: utf-8 -*-
#!/usr/bin/env python
from __future__ import unicode_literals
# Load Library files
import en_core_web_sm
import spacy
import textacy
nlp = en_core_web_sm.load()
SUBJ = ["nsubj","nsubjpass"]
VERB = ["ROOT"]
OBJ = ["dobj", "pobj", "dobj"]
text = nlp(u'The cat sat on the mat. The cat jumped and picked up the biscuit. The cat ate biscuit and cookies.')
sub_toks = [tok for tok in text if (tok.dep_ in SUBJ) ]
obj_toks = [tok for tok in text if (tok.dep_ in OBJ) ]
vrb_toks = [tok for tok in text if (tok.dep_ in VERB) ]
text_ext = list(textacy.extract.subject_verb_object_triples(text))
print("Subjects:", sub_toks)
print("VERB :", vrb_toks)
print("OBJECT(s):", obj_toks)
print ("SVO:", text_ext)
输出:
(u'Subjects:', [cat, cat, cat])
(u'VERB :', [sat, jumped, ate])
(u'OBJECT(s):', [mat, biscuit, biscuit])
(u'SVO:', [(cat, ate, biscuit), (cat, ate, cookies)])
- 问题 1:SVO 被覆盖。为什么?
- 问题2:如何识别句子为
SVOO SVO SVVO
等?
编辑 1:
我正在概念化的一些方法。
from __future__ import unicode_literals
import spacy,en_core_web_sm
import textacy
nlp = en_core_web_sm.load()
sentence = 'I will go to the mall.'
doc = nlp(sentence)
chk_set = set(['PRP','MD','NN'])
result = chk_set.issubset(t.tag_ for t in doc)
if result == False:
print "SVO not identified"
elif result == True: # shouldn't do this
print "SVO"
else:
print "Others..."
编辑 2:
取得进一步进展
from __future__ import unicode_literals
import spacy,en_core_web_sm
import textacy
nlp = en_core_web_sm.load()
sentence = 'The cat sat on the mat. The cat jumped and picked up the biscuit. The cat ate biscuit and cookies.'
doc = nlp(sentence)
print(" ".join([token.dep_ for token in doc]))
当前输出:
det nsubj ROOT prep det pobj punct det nsubj ROOT cc conj prt det dobj punct det nsubj ROOT dobj cc conj punct
预期输出:
SVO SVVO SVOO
想法是将依存标签分解为简单的主谓和宾语模型。
如果没有其他选项可用,考虑使用正则表达式实现它。但这是我最后的选择。
编辑 3:
经过学习this link,有所进步。
def testSVOs():
nlp = en_core_web_sm.load()
tok = nlp("The cat sat on the mat. The cat jumped for the biscuit. The cat ate biscuit and cookies.")
svos = findSVOs(tok)
print(svos)
当前输出:
[(u'cat', u'sat', u'mat'), (u'cat', u'jumped', u'biscuit'), (u'cat', u'ate', u'biscuit'), (u'cat', u'ate', u'cookies')]
预期输出:
我期待句子的符号。尽管我能够提取关于如何将其转换为 SVO 符号的 SVO。它更多的是模式识别而不是句子内容本身。
SVO SVO SVOO
Issue 1: The SVO are overwritten. Why?
这是 textacy
问题。这部分效果不是很好,请看这个 blog
Issue 2: How to identify the sentence as SVOO SVO SVVO etc.?
您应该解析依赖关系树。 SpaCy
提供了信息,您只需要编写一组规则将其提取出来,使用 .head
、.left
、.right
和 .children
属性。
>>for word in text:
print('%10s %5s %10s %10s %s'%(word.text, word.tag_, word.dep_, word.pos_, word.head.text_))
The DT det DET cat
cat NN nsubj NOUN sat
sat VBD ROOT VERB sat
on IN prep ADP sat
the DT det DET mat
mat NN pobj NOUN on
. . punct PUNCT sat
of IN ROOT ADP of
the DT det DET lab
art NN compound NOUN lab
lab NN pobj NOUN of
. . punct PUNCT of
The DT det DET cat
cat NN nsubj NOUN jumped
jumped VBD ROOT VERB jumped
and CC cc CCONJ jumped
picked VBD conj VERB jumped
up RP prt PART picked
the DT det DET biscuit
biscuit NN dobj NOUN picked
. . punct PUNCT jumped
The DT det DET cat
cat NN nsubj NOUN ate
ate VBD ROOT VERB ate
biscuit NN dobj NOUN ate
and CC cc CCONJ biscuit
cookies NNS conj NOUN biscuit
. . punct PUNCT ate
我建议你看看这个 code,只需将 pobj
添加到 OBJECTS
的列表中,你就会得到你的 SVO 和 SVOO。稍作调整,您也可以获得 SVVO。
我打算使用 spacy 和 textacy 来识别英语的句子结构。
例如: 猫坐在垫子上——SVO,猫跳起来捡起饼干——SVV0。 猫吃了饼干和饼干。 - 嘘。
该程序应该阅读一个段落,return 每个句子的输出为 SVO、SVOO、SVVO 或其他自定义结构。
目前的努力:
# -*- coding: utf-8 -*-
#!/usr/bin/env python
from __future__ import unicode_literals
# Load Library files
import en_core_web_sm
import spacy
import textacy
nlp = en_core_web_sm.load()
SUBJ = ["nsubj","nsubjpass"]
VERB = ["ROOT"]
OBJ = ["dobj", "pobj", "dobj"]
text = nlp(u'The cat sat on the mat. The cat jumped and picked up the biscuit. The cat ate biscuit and cookies.')
sub_toks = [tok for tok in text if (tok.dep_ in SUBJ) ]
obj_toks = [tok for tok in text if (tok.dep_ in OBJ) ]
vrb_toks = [tok for tok in text if (tok.dep_ in VERB) ]
text_ext = list(textacy.extract.subject_verb_object_triples(text))
print("Subjects:", sub_toks)
print("VERB :", vrb_toks)
print("OBJECT(s):", obj_toks)
print ("SVO:", text_ext)
输出:
(u'Subjects:', [cat, cat, cat])
(u'VERB :', [sat, jumped, ate])
(u'OBJECT(s):', [mat, biscuit, biscuit])
(u'SVO:', [(cat, ate, biscuit), (cat, ate, cookies)])
- 问题 1:SVO 被覆盖。为什么?
- 问题2:如何识别句子为
SVOO SVO SVVO
等?
编辑 1:
我正在概念化的一些方法。
from __future__ import unicode_literals
import spacy,en_core_web_sm
import textacy
nlp = en_core_web_sm.load()
sentence = 'I will go to the mall.'
doc = nlp(sentence)
chk_set = set(['PRP','MD','NN'])
result = chk_set.issubset(t.tag_ for t in doc)
if result == False:
print "SVO not identified"
elif result == True: # shouldn't do this
print "SVO"
else:
print "Others..."
编辑 2:
取得进一步进展
from __future__ import unicode_literals
import spacy,en_core_web_sm
import textacy
nlp = en_core_web_sm.load()
sentence = 'The cat sat on the mat. The cat jumped and picked up the biscuit. The cat ate biscuit and cookies.'
doc = nlp(sentence)
print(" ".join([token.dep_ for token in doc]))
当前输出:
det nsubj ROOT prep det pobj punct det nsubj ROOT cc conj prt det dobj punct det nsubj ROOT dobj cc conj punct
预期输出:
SVO SVVO SVOO
想法是将依存标签分解为简单的主谓和宾语模型。
如果没有其他选项可用,考虑使用正则表达式实现它。但这是我最后的选择。
编辑 3:
经过学习this link,有所进步。
def testSVOs():
nlp = en_core_web_sm.load()
tok = nlp("The cat sat on the mat. The cat jumped for the biscuit. The cat ate biscuit and cookies.")
svos = findSVOs(tok)
print(svos)
当前输出:
[(u'cat', u'sat', u'mat'), (u'cat', u'jumped', u'biscuit'), (u'cat', u'ate', u'biscuit'), (u'cat', u'ate', u'cookies')]
预期输出:
我期待句子的符号。尽管我能够提取关于如何将其转换为 SVO 符号的 SVO。它更多的是模式识别而不是句子内容本身。
SVO SVO SVOO
Issue 1: The SVO are overwritten. Why?
这是 textacy
问题。这部分效果不是很好,请看这个 blog
Issue 2: How to identify the sentence as SVOO SVO SVVO etc.?
您应该解析依赖关系树。 SpaCy
提供了信息,您只需要编写一组规则将其提取出来,使用 .head
、.left
、.right
和 .children
属性。
>>for word in text:
print('%10s %5s %10s %10s %s'%(word.text, word.tag_, word.dep_, word.pos_, word.head.text_))
The DT det DET cat
cat NN nsubj NOUN sat
sat VBD ROOT VERB sat
on IN prep ADP sat
the DT det DET mat
mat NN pobj NOUN on
. . punct PUNCT sat
of IN ROOT ADP of
the DT det DET lab
art NN compound NOUN lab
lab NN pobj NOUN of
. . punct PUNCT of
The DT det DET cat
cat NN nsubj NOUN jumped
jumped VBD ROOT VERB jumped
and CC cc CCONJ jumped
picked VBD conj VERB jumped
up RP prt PART picked
the DT det DET biscuit
biscuit NN dobj NOUN picked
. . punct PUNCT jumped
The DT det DET cat
cat NN nsubj NOUN ate
ate VBD ROOT VERB ate
biscuit NN dobj NOUN ate
and CC cc CCONJ biscuit
cookies NNS conj NOUN biscuit
. . punct PUNCT ate
我建议你看看这个 code,只需将 pobj
添加到 OBJECTS
的列表中,你就会得到你的 SVO 和 SVOO。稍作调整,您也可以获得 SVVO。