如何在没有语法形式的情况下获取 nltk 树的节点?
How to get the nodes of a nltk tree without their grammatical form?
我设法制作了一个 class,它从 spaCy 创建了一棵树,我想在节点中只保留单词而不是整个语法。也就是说从start_VB_ROOT
.
有start
概括一下,例如用句子 Beyonce 什么时候开始流行? 输入是
[Tree('start_VB_ROOT', ['When_WRB_advmod', 'did_VBD_aux', 'Beyonce_NNP_nsubj', Tree('becoming_VBG_xcomp', ['popular_JJ_acomp']), '?_._punct'])]
并且 预期输出 具有我在下面提供的功能将是一棵树:
<class 'str'> When_WRB_advmod
son creation : When
<class 'str'> did_VBD_aux
son creation : did
<class 'str'> Beyonce_NNP_nsubj
son creation : Beyonce
<class 'nltk.tree.Tree'> (becoming_VBG_xcomp popular_JJ_acomp)
sub tree creation
son: becoming_VBG_xcomp
<class 'str'> popular_JJ_acomp
son creation popular
end of sub tree creation
<class 'str'> ?_._punct
son creation ?
这是函数
class WordTree:
'''Tree for spaCy dependency parsing array'''
def __init__(self, array, parent = None):
"""
Construct a new 'WordTree' object.
:param array: The array contening the dependency
:param parent: The parent of the array if exists
:return: returns nothing
"""
self.parent = []
self.children = []
self.data = array
for element in array[0]:
print(type(element),element)
# we check if we got a subtree
if type(element) is Tree:
print("sub tree creation")
self.children.append(element.label())
print("son:",element.label())
t = WordTree([element],element.label()) # should I verify if parent is empty ?
print("end of sub tree creation")
# else if we have a string we create a son
elif type(element) is str:
print("son creation",element)
self.children.append(element)
# in other case we have a problem
else:
print("issue?")
break
此时给出以下输出:
<class 'str'> When_WRB_advmod
son creation When_WRB_advmod
<class 'str'> did_VBD_aux
son creation did_VBD_aux
<class 'str'> Beyonce_NNP_nsubj
son creation Beyonce_NNP_nsubj
<class 'nltk.tree.Tree'> (becoming_VBG_xcomp popular_JJ_acomp)
sub tree creation
son: becoming_VBG_xcomp
<class 'str'> popular_JJ_acomp
son creation popular_JJ_acomp
end of sub tree creation
<class 'str'> ?_._punct
son creation ?_._punct
首先,请注意问题中的 SpaCy "grammatical forms" 实际上是附加了 POS 标签和依赖标签的表面标记。在这种情况下,您应该只检索 nltk
中的 Tree.leaves()
和 Tree.label()
对象。
但操作 SpaCy 解析器的原始输出比在问题中弄乱数据格式更容易。
在继续之前请参阅,在进行深度优先遍历时考虑递归(没有类)。
为了将来 reader,请阅读问题中的评论,然后再继续下面的回答。
如果您只想简单地从叶子和标签中删除 POS 和依赖标签,试试这个:
from nltk import Tree
parse = Tree('start_VB_ROOT',
['When_WRB_advmod', 'did_VBD_aux', 'Beyonce_NNP_nsubj',
Tree('becoming_VBG_xcomp',
['popular_JJ_acomp']),
'?_._punct']
)
def traverse_tree(tree, is_subtree=False):
for subtree in tree:
print(type(subtree), subtree)
if type(subtree) == Tree:
# Iterate through the depth of the subtree.
print('sub tree creation')
traverse_tree(subtree, True)
print('end of sub tree creation')
elif type(subtree) == str:
surface_form = subtree.split('_')[0]
print('son creation:', surface_form)
traverse_tree(parse)
[输出]:
<class 'str'> When_WRB_advmod
son creation: When
<class 'str'> did_VBD_aux
son creation: did
<class 'str'> Beyonce_NNP_nsubj
son creation: Beyonce
<class 'nltk.tree.Tree'> (becoming_VBG_xcomp popular_JJ_acomp)
sub tree creation
<class 'str'> popular_JJ_acomp
son creation: popular
end of sub tree creation
<class 'str'> ?_._punct
son creation: ?
我设法制作了一个 class,它从 spaCy 创建了一棵树,我想在节点中只保留单词而不是整个语法。也就是说从start_VB_ROOT
.
start
概括一下,例如用句子 Beyonce 什么时候开始流行? 输入是
[Tree('start_VB_ROOT', ['When_WRB_advmod', 'did_VBD_aux', 'Beyonce_NNP_nsubj', Tree('becoming_VBG_xcomp', ['popular_JJ_acomp']), '?_._punct'])]
并且 预期输出 具有我在下面提供的功能将是一棵树:
<class 'str'> When_WRB_advmod
son creation : When
<class 'str'> did_VBD_aux
son creation : did
<class 'str'> Beyonce_NNP_nsubj
son creation : Beyonce
<class 'nltk.tree.Tree'> (becoming_VBG_xcomp popular_JJ_acomp)
sub tree creation
son: becoming_VBG_xcomp
<class 'str'> popular_JJ_acomp
son creation popular
end of sub tree creation
<class 'str'> ?_._punct
son creation ?
这是函数
class WordTree:
'''Tree for spaCy dependency parsing array'''
def __init__(self, array, parent = None):
"""
Construct a new 'WordTree' object.
:param array: The array contening the dependency
:param parent: The parent of the array if exists
:return: returns nothing
"""
self.parent = []
self.children = []
self.data = array
for element in array[0]:
print(type(element),element)
# we check if we got a subtree
if type(element) is Tree:
print("sub tree creation")
self.children.append(element.label())
print("son:",element.label())
t = WordTree([element],element.label()) # should I verify if parent is empty ?
print("end of sub tree creation")
# else if we have a string we create a son
elif type(element) is str:
print("son creation",element)
self.children.append(element)
# in other case we have a problem
else:
print("issue?")
break
此时给出以下输出:
<class 'str'> When_WRB_advmod
son creation When_WRB_advmod
<class 'str'> did_VBD_aux
son creation did_VBD_aux
<class 'str'> Beyonce_NNP_nsubj
son creation Beyonce_NNP_nsubj
<class 'nltk.tree.Tree'> (becoming_VBG_xcomp popular_JJ_acomp)
sub tree creation
son: becoming_VBG_xcomp
<class 'str'> popular_JJ_acomp
son creation popular_JJ_acomp
end of sub tree creation
<class 'str'> ?_._punct
son creation ?_._punct
首先,请注意问题中的 SpaCy "grammatical forms" 实际上是附加了 POS 标签和依赖标签的表面标记。在这种情况下,您应该只检索 nltk
中的 Tree.leaves()
和 Tree.label()
对象。
但操作 SpaCy 解析器的原始输出比在问题中弄乱数据格式更容易。
在继续之前请参阅
为了将来 reader,请阅读问题中的评论,然后再继续下面的回答。
如果您只想简单地从叶子和标签中删除 POS 和依赖标签,试试这个:
from nltk import Tree
parse = Tree('start_VB_ROOT',
['When_WRB_advmod', 'did_VBD_aux', 'Beyonce_NNP_nsubj',
Tree('becoming_VBG_xcomp',
['popular_JJ_acomp']),
'?_._punct']
)
def traverse_tree(tree, is_subtree=False):
for subtree in tree:
print(type(subtree), subtree)
if type(subtree) == Tree:
# Iterate through the depth of the subtree.
print('sub tree creation')
traverse_tree(subtree, True)
print('end of sub tree creation')
elif type(subtree) == str:
surface_form = subtree.split('_')[0]
print('son creation:', surface_form)
traverse_tree(parse)
[输出]:
<class 'str'> When_WRB_advmod
son creation: When
<class 'str'> did_VBD_aux
son creation: did
<class 'str'> Beyonce_NNP_nsubj
son creation: Beyonce
<class 'nltk.tree.Tree'> (becoming_VBG_xcomp popular_JJ_acomp)
sub tree creation
<class 'str'> popular_JJ_acomp
son creation: popular
end of sub tree creation
<class 'str'> ?_._punct
son creation: ?