从 nltk 树获取分块结果的任何好的或更好或直接的方法?
Any good or better or direct way to get the chunking result from a nltk Tree?
我想对字符串进行分块以获得特定高度的组。应保留原始顺序,并完整包含所有原始单词。
import nltk
height = 2
sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"), ("dog", "NN"), ("barked","VBD"), ("at", "IN"), ("the", "DT"), ("cat", "NN")]
pattern = """NP: {<DT>?<JJ>*<NN>}
VBD: {<VBD>}
IN: {<IN>}"""
NPChunker = nltk.RegexpParser(pattern)
result = NPChunker.parse(sentence)
In [29]: Tree.fromstring(str(result)).pretty_print()
S
_________________|_____________________________
NP VBD IN NP
________|_________________ | | _____|____
the/DT little/JJ yellow/JJ dog/NN barked/VBD at/IN the/DT cat/NN
我的方法有点像下面这样的蛮力:
In [30]: [list(map(lambda x: x[0], _tree.leaves())) for _tree in result.subtrees(lambda x: x.height()==height)]
Out[30]: [['the', 'little', 'yellow', 'dog'], ['barked'], ['at'], ['the', 'cat']]
我认为应该存在一些直接的 API 或者我可以用来做分块的东西。任何建议都非常感谢。
不,NLTK 中没有任何内置函数来return 一定深度的树。
但是你可以使用来自
的深度优先遍历
为了提高效率,您可以深度优先迭代,并且仅在深度小于必要深度时才重复,例如
import nltk
sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"), ("dog", "NN"), ("barked","VBD"), ("at", "IN"), ("the", "DT"), ("cat", "NN")]
pattern = """NP: {<DT>?<JJ>*<NN>}
VBD: {<VBD>}
IN: {<IN>}"""
NPChunker = nltk.RegexpParser(pattern)
result = NPChunker.parse(sentence)
def traverse_tree(tree, depth=float('inf')):
"""
Traversing the Tree depth-first,
yield leaves up to `depth` level.
"""
for subtree in tree:
if type(subtree) == nltk.tree.Tree:
if subtree.height() <= depth:
yield subtree.leaves()
traverse_tree(subtree)
list(traverse_tree(result, 2))
[输出]:
[[('the', 'DT'), ('little', 'JJ'), ('yellow', 'JJ'), ('dog', 'NN')],
[('barked', 'VBD')],
[('at', 'IN')],
[('the', 'DT'), ('cat', 'NN')]]
另一个例子:
x = """(S
(NP the/DT
(AP little/JJ yellow/JJ)
dog/NN)
(VBD barked/VBD)
(IN at/IN)
(NP the/DT cat/NN))"""
list(traverse_tree(Tree.fromstring(x), 2))
[输出]:
[['barked/VBD'], ['at/IN'], ['the/DT', 'cat/NN']]
我想对字符串进行分块以获得特定高度的组。应保留原始顺序,并完整包含所有原始单词。
import nltk
height = 2
sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"), ("dog", "NN"), ("barked","VBD"), ("at", "IN"), ("the", "DT"), ("cat", "NN")]
pattern = """NP: {<DT>?<JJ>*<NN>}
VBD: {<VBD>}
IN: {<IN>}"""
NPChunker = nltk.RegexpParser(pattern)
result = NPChunker.parse(sentence)
In [29]: Tree.fromstring(str(result)).pretty_print()
S
_________________|_____________________________
NP VBD IN NP
________|_________________ | | _____|____
the/DT little/JJ yellow/JJ dog/NN barked/VBD at/IN the/DT cat/NN
我的方法有点像下面这样的蛮力:
In [30]: [list(map(lambda x: x[0], _tree.leaves())) for _tree in result.subtrees(lambda x: x.height()==height)]
Out[30]: [['the', 'little', 'yellow', 'dog'], ['barked'], ['at'], ['the', 'cat']]
我认为应该存在一些直接的 API 或者我可以用来做分块的东西。任何建议都非常感谢。
不,NLTK 中没有任何内置函数来return 一定深度的树。
但是你可以使用来自
为了提高效率,您可以深度优先迭代,并且仅在深度小于必要深度时才重复,例如
import nltk
sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"), ("dog", "NN"), ("barked","VBD"), ("at", "IN"), ("the", "DT"), ("cat", "NN")]
pattern = """NP: {<DT>?<JJ>*<NN>}
VBD: {<VBD>}
IN: {<IN>}"""
NPChunker = nltk.RegexpParser(pattern)
result = NPChunker.parse(sentence)
def traverse_tree(tree, depth=float('inf')):
"""
Traversing the Tree depth-first,
yield leaves up to `depth` level.
"""
for subtree in tree:
if type(subtree) == nltk.tree.Tree:
if subtree.height() <= depth:
yield subtree.leaves()
traverse_tree(subtree)
list(traverse_tree(result, 2))
[输出]:
[[('the', 'DT'), ('little', 'JJ'), ('yellow', 'JJ'), ('dog', 'NN')],
[('barked', 'VBD')],
[('at', 'IN')],
[('the', 'DT'), ('cat', 'NN')]]
另一个例子:
x = """(S
(NP the/DT
(AP little/JJ yellow/JJ)
dog/NN)
(VBD barked/VBD)
(IN at/IN)
(NP the/DT cat/NN))"""
list(traverse_tree(Tree.fromstring(x), 2))
[输出]:
[['barked/VBD'], ['at/IN'], ['the/DT', 'cat/NN']]