如何在没有语法形式的情况下获取 nltk 树的节点?

How to get the nodes of a nltk tree without their grammatical form?

我设法制作了一个 class,它从 spaCy 创建了一棵树,我想在节点中只保留单词而不是整个语法。也就是说从start_VB_ROOT.

start

概括一下,例如用句子 Beyonce 什么时候开始流行? 输入是

[Tree('start_VB_ROOT', ['When_WRB_advmod', 'did_VBD_aux', 'Beyonce_NNP_nsubj', Tree('becoming_VBG_xcomp', ['popular_JJ_acomp']), '?_._punct'])]

并且 预期输出 具有我在下面提供的功能将是一棵树:

<class 'str'> When_WRB_advmod
son creation : When
<class 'str'> did_VBD_aux
son creation : did
<class 'str'> Beyonce_NNP_nsubj
son creation : Beyonce
<class 'nltk.tree.Tree'> (becoming_VBG_xcomp popular_JJ_acomp)
sub tree creation
son: becoming_VBG_xcomp
<class 'str'> popular_JJ_acomp
son creation popular
end of sub tree creation
<class 'str'> ?_._punct
son creation ?

这是函数

class WordTree:
    '''Tree for spaCy dependency parsing array'''
    def __init__(self, array, parent = None):
        """
        Construct a new 'WordTree' object.

        :param array: The array contening the dependency
        :param parent: The parent of the array if exists
        :return: returns nothing
        """
        self.parent = []
        self.children = []
        self.data = array

        for element in array[0]:
            print(type(element),element)
            # we check if we got a subtree
            if type(element) is Tree:
                print("sub tree creation")
                self.children.append(element.label())
                print("son:",element.label())
                t = WordTree([element],element.label()) # should I verify if parent is empty ?
                print("end of sub tree creation")
            # else if we have a string we create a son
            elif type(element) is str:
                print("son creation",element)
                self.children.append(element)
            # in other case we have a problem
            else:
                print("issue?")
                break

此时给出以下输出:

<class 'str'> When_WRB_advmod
son creation When_WRB_advmod
<class 'str'> did_VBD_aux
son creation did_VBD_aux
<class 'str'> Beyonce_NNP_nsubj
son creation Beyonce_NNP_nsubj
<class 'nltk.tree.Tree'> (becoming_VBG_xcomp popular_JJ_acomp)
sub tree creation
son: becoming_VBG_xcomp
<class 'str'> popular_JJ_acomp
son creation popular_JJ_acomp
end of sub tree creation
<class 'str'> ?_._punct
son creation ?_._punct

首先,请注意问题中的 SpaCy "grammatical forms" 实际上是附加了 POS 标签和依赖标签的表面标记。在这种情况下,您应该只检索 nltk 中的 Tree.leaves()Tree.label() 对象。

但操作 SpaCy 解析器的原始输出比在问题中弄乱数据格式更容易。

在继续之前请参阅,在进行深度优先遍历时考虑递归(没有类)。

为了将来 reader,请阅读问题中的评论,然后再继续下面的回答。


如果您只想简单地从叶子和标签中删除 POS 和依赖标签,试试这个:

from nltk import Tree

parse = Tree('start_VB_ROOT', 
                 ['When_WRB_advmod', 'did_VBD_aux', 'Beyonce_NNP_nsubj', 
                 Tree('becoming_VBG_xcomp', 
                      ['popular_JJ_acomp']), 
                  '?_._punct']
            )

def traverse_tree(tree, is_subtree=False):
    for subtree in tree:
        print(type(subtree), subtree)
        if type(subtree) == Tree:
            # Iterate through the depth of the subtree.
            print('sub tree creation')
            traverse_tree(subtree, True)
            print('end of sub tree creation')
        elif type(subtree) == str:
            surface_form = subtree.split('_')[0]
            print('son creation:', surface_form)

traverse_tree(parse)

[输出]:

<class 'str'> When_WRB_advmod
son creation: When
<class 'str'> did_VBD_aux
son creation: did
<class 'str'> Beyonce_NNP_nsubj
son creation: Beyonce
<class 'nltk.tree.Tree'> (becoming_VBG_xcomp popular_JJ_acomp)
sub tree creation
<class 'str'> popular_JJ_acomp
son creation: popular
end of sub tree creation
<class 'str'> ?_._punct
son creation: ?