如何将 NLP 解析树拆分为子句(独立和从属)?

How to split an NLP parse tree to clauses (independent and subordinate)?

给定一个像

这样的 NLP 解析树
(ROOT (S (NP (PRP You)) (VP (MD could) (VP (VB say) (SBAR (IN that) (S (NP (PRP they)) (ADVP (RB regularly)) (VP (VB catch) (NP (NP (DT a) (NN shower)) (, ,) (SBAR (WHNP (WDT which)) (S (VP (VBZ adds) (PP (TO to) (NP (NP (PRP$ their) (NN exhilaration)) (CC and) (NP (FW joie) (FW de) (FW vivre))))))))))))) (. .)))

原句是"You could say that they regularly catch a shower, which adds to their exhilaration and joie de vivre."

如何提取子句并对其进行逆向工程? 我们将在 S 和 SBAR 处拆分(以保留子句的类型,例如从属)

 - (S (NP (PRP You)) (VP (MD could) (VP (VB say) 
 - (SBAR (IN that) (S (NP (PRP they)) (ADVP (RB regularly)) (VP (VB catch) (NP (NP (DT a) (NN shower))
 - (, ,) (SBAR (WHNP (WDT which)) (S (VP (VBZ adds) (PP (TO to)
   (NP (NP (PRP$ their) (NN exhilaration)) (CC and) (NP (FW joie) (FW
   de) (FW vivre))))))))))))) (. .)))

到达

 - You could say
 - that they regularly catch a shower 
 - , which adds to their exhilaration and joie de vivre.

在S和SBAR分裂似乎很容易。问题似乎是从片段中剥离所有 POS 标签和块。

您可以使用 Tree.subtrees()。有关详细信息,请查看 NLTK Tree Class.

代码:

from nltk import Tree

parse_str = "(ROOT (S (NP (PRP You)) (VP (MD could) (VP (VB say) (SBAR (IN that) (S (NP (PRP they)) (ADVP (RB regularly)) (VP (VB catch) (NP (NP (DT a) (NN shower)) (, ,) (SBAR (WHNP (WDT which)) (S (VP (VBZ adds) (PP (TO to) (NP (NP (PRP$ their) (NN exhilaration)) (CC and) (NP (FW joie) (FW de) (FW vivre))))))))))))) (. .)))"
#parse_str = "(ROOT (S (SBAR (IN Though) (S (NP (PRP he)) (VP (VBD was) (ADJP (RB very) (JJ rich))))) (, ,) (NP (PRP he)) (VP (VBD was) (ADVP (RB still)) (ADJP (RB very) (JJ unhappy))) (. .)))"

t = Tree.fromstring(parse_str)
#print t

subtexts = []
for subtree in t.subtrees():
    if subtree.label()=="S" or subtree.label()=="SBAR":
        #print subtree.leaves()
        subtexts.append(' '.join(subtree.leaves()))
#print subtexts

presubtexts = subtexts[:]       # ADDED IN EDIT for leftover check

for i in reversed(range(len(subtexts)-1)):
    subtexts[i] = subtexts[i][0:subtexts[i].index(subtexts[i+1])]

for text in subtexts:
    print text

# ADDED IN EDIT - Not sure for generalized cases
leftover = presubtexts[0][presubtexts[0].index(presubtexts[1])+len(presubtexts[1]):]
print leftover

输出:

You could say 
that 
they regularly catch a shower , 
which 
adds to their exhilaration and joie de vivre
 .

首先获取解析树:

# stanza.install_corenlp()

from stanza.server import CoreNLPClient

text = "Joe realized that the train was late while he waited at the train station"

with CoreNLPClient(
        annotators=['tokenize', 'pos', 'lemma', 'parse', 'depparse'],
        output_format="json",
        timeout=30000,
        memory='16G') as client:
    output = client.annotate(text)
    # print(output.sentence[0])
    parse_tree = output['sentences'][0]['parse']
    parse_tree = ' '.join(parse_tree.split())

然后使用此 gist 通过调用提取子句:

print_clauses(parse_str=parse_tree)

输出将是:

{'the train was late', 'he waited at the train station', 'Joe realized'}