我怎样才能漂亮地打印一个 nltk 树对象?
How can I pretty print a nltk tree object?
我想以直观的方式查看下面的结果是否是我需要的:
import nltk
sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"), ("dog", "NN"), ("barked","VBD"), ("at", "IN"), ("the", "DT"), ("cat", "NN")]
pattern = """NP: {<DT>?<JJ>*<NN>}
VBD: {<VBD>}
IN: {<IN>}"""
NPChunker = nltk.RegexpParser(pattern)
result = NPChunker.parse(sentence)
来源:
我不明白为什么我不能 pretty_print result
。
result.pretty_print()
错误显示 TypeError: not all arguments converted during string formatting
。我用的是Python3.5,nltk3.3。
如果您正在寻找带括号的解析输出,您可以使用 Tree.pprint()
:
>>> import nltk
>>> sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"), ("dog", "NN"), ("barked","VBD"), ("at", "IN"), ("the", "DT"), ("cat", "NN")]
>>>
>>> pattern = """NP: {<DT>?<JJ>*<NN>}
... VBD: {<VBD>}
... IN: {<IN>}"""
>>> NPChunker = nltk.RegexpParser(pattern)
>>> result = NPChunker.parse(sentence)
>>> result.pprint()
(S
(NP the/DT little/JJ yellow/JJ dog/NN)
(VBD barked/VBD)
(IN at/IN)
(NP the/DT cat/NN))
但您很可能正在寻找
S
_________________|_____________________________
NP VBD IN NP
________|_________________ | | _____|____
the/DT little/JJ yellow/JJ dog/NN barked/VBD at/IN the/DT cat/NN
让我们深入研究 Tree.pretty_print()
https://github.com/nltk/nltk/blob/develop/nltk/tree.py#L692 中的代码:
def pretty_print(self, sentence=None, highlight=(), stream=None, **kwargs):
"""
Pretty-print this tree as ASCII or Unicode art.
For explanation of the arguments, see the documentation for
`nltk.treeprettyprinter.TreePrettyPrinter`.
"""
from nltk.treeprettyprinter import TreePrettyPrinter
print(TreePrettyPrinter(self, sentence, highlight).text(**kwargs),
file=stream)
它正在创建一个 TreePrettyPrinter
对象,https://github.com/nltk/nltk/blob/develop/nltk/treeprettyprinter.py#L50
class TreePrettyPrinter(object):
def __init__(self, tree, sentence=None, highlight=()):
if sentence is None:
leaves = tree.leaves()
if (leaves and not any(len(a) == 0 for a in tree.subtrees())
and all(isinstance(a, int) for a in leaves)):
sentence = [str(a) for a in leaves]
else:
# this deals with empty nodes (frontier non-terminals)
# and multiple/mixed terminals under non-terminals.
tree = tree.copy(True)
sentence = []
for a in tree.subtrees():
if len(a) == 0:
a.append(len(sentence))
sentence.append(None)
elif any(not isinstance(b, Tree) for b in a):
for n, b in enumerate(a):
if not isinstance(b, Tree):
a[n] = len(sentence)
sentence.append('%s' % b)
self.nodes, self.coords, self.edges, self.highlight = self.nodecoords(
tree, sentence, highlight)
看起来引发错误的行是 sentence.append('%s' % b)
https://github.com/nltk/nltk/blob/develop/nltk/treeprettyprinter.py#L97
问题是为什么会引发 TypeError?
TypeError: not all arguments converted during string formatting
如果我们仔细看,看起来我们可以使用 print('%s' % b)
最基本的 python 类型
# String
>>> x = 'abc'
>>> type(x)
<class 'str'>
>>> print('%s' % x)
abc
# Integer
>>> x = 123
>>> type(x)
<class 'int'>
>>> print('%s' % x)
123
# Float
>>> x = 1.23
>>> type(x)
<class 'float'>
>>> print('%s' % x)
1.23
# Boolean
>>> x = True
>>> type(x)
<class 'bool'>
>>> print('%s' % x)
True
令人惊讶的是,它甚至适用于列表!
>>> x = ['abc', 'def']
>>> type(x)
<class 'list'>
>>> print('%s' % x)
['abc', 'def']
但是它被tuple
阻碍了!!
>>> x = ('DT', 123)
>>> x = ('abc', 'def')
>>> type(x)
<class 'tuple'>
>>> print('%s' % x)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: not all arguments converted during string formatting
所以如果我们回到 https://github.com/nltk/nltk/blob/develop/nltk/treeprettyprinter.py#L95
处的代码
if not isinstance(b, Tree):
a[n] = len(sentence)
sentence.append('%s' % b)
因为我们知道 sentence.append('%s' % b)
无法处理 tuple
,添加一个元组类型检查并以某种方式连接元组中的项目并转换为 str
将产生很好的效果pretty_print
:
if not isinstance(b, Tree):
a[n] = len(sentence)
if type(b) == tuple:
b = '/'.join(b)
sentence.append('%s' % b)
[输出]:
S
_________________|_____________________________
NP VBD IN NP
________|_________________ | | _____|____
the/DT little/JJ yellow/JJ dog/NN barked/VBD at/IN the/DT cat/NN
在不更改nltk
代码的情况下,是否仍然可以获得漂亮的印刷品?
让我们看看 result
即 Tree
对象的样子:
Tree('S', [Tree('NP', [('the', 'DT'), ('little', 'JJ'), ('yellow', 'JJ'), ('dog', 'NN')]), Tree('VBD', [('barked', 'VBD')]), Tree('IN', [('at', 'IN')]), Tree('NP', [('the', 'DT'), ('cat', 'NN')])])
看起来叶子被保存为字符串元组列表,例如[('the', 'DT'), ('cat', 'NN')]
,所以我们可以做一些修改,使其成为字符串列表,例如[('the/DT'), ('cat/NN')]
,这样Tree.pretty_print()
就可以很好玩了。
因为我们知道 Tree.pprint()
有助于将字符串元组连接成我们想要的形式,即
(S
(NP the/DT little/JJ yellow/JJ dog/NN)
(VBD barked/VBD)
(IN at/IN)
(NP the/DT cat/NN))
我们可以简单地输出到括号中的解析字符串,然后用 Tree.fromstring()
:
重新读取解析 Tree
对象
from nltk import Tree
Tree.fromstring(str(result)).pretty_print()
决赛:
import nltk
sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"), ("dog", "NN"), ("barked","VBD"), ("at", "IN"), ("the", "DT"), ("cat", "NN")]
pattern = """NP: {<DT>?<JJ>*<NN>}
VBD: {<VBD>}
IN: {<IN>}"""
NPChunker = nltk.RegexpParser(pattern)
result = NPChunker.parse(sentence)
Tree.fromstring(str(result)).pretty_print()
[输出]:
S
_________________|_____________________________
NP VBD IN NP
________|_________________ | | _____|____
the/DT little/JJ yellow/JJ dog/NN barked/VBD at/IN the/DT cat/NN
我想以直观的方式查看下面的结果是否是我需要的:
import nltk
sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"), ("dog", "NN"), ("barked","VBD"), ("at", "IN"), ("the", "DT"), ("cat", "NN")]
pattern = """NP: {<DT>?<JJ>*<NN>}
VBD: {<VBD>}
IN: {<IN>}"""
NPChunker = nltk.RegexpParser(pattern)
result = NPChunker.parse(sentence)
来源:
我不明白为什么我不能 pretty_print result
。
result.pretty_print()
错误显示 TypeError: not all arguments converted during string formatting
。我用的是Python3.5,nltk3.3。
如果您正在寻找带括号的解析输出,您可以使用 Tree.pprint()
:
>>> import nltk
>>> sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"), ("dog", "NN"), ("barked","VBD"), ("at", "IN"), ("the", "DT"), ("cat", "NN")]
>>>
>>> pattern = """NP: {<DT>?<JJ>*<NN>}
... VBD: {<VBD>}
... IN: {<IN>}"""
>>> NPChunker = nltk.RegexpParser(pattern)
>>> result = NPChunker.parse(sentence)
>>> result.pprint()
(S
(NP the/DT little/JJ yellow/JJ dog/NN)
(VBD barked/VBD)
(IN at/IN)
(NP the/DT cat/NN))
但您很可能正在寻找
S
_________________|_____________________________
NP VBD IN NP
________|_________________ | | _____|____
the/DT little/JJ yellow/JJ dog/NN barked/VBD at/IN the/DT cat/NN
让我们深入研究 Tree.pretty_print()
https://github.com/nltk/nltk/blob/develop/nltk/tree.py#L692 中的代码:
def pretty_print(self, sentence=None, highlight=(), stream=None, **kwargs):
"""
Pretty-print this tree as ASCII or Unicode art.
For explanation of the arguments, see the documentation for
`nltk.treeprettyprinter.TreePrettyPrinter`.
"""
from nltk.treeprettyprinter import TreePrettyPrinter
print(TreePrettyPrinter(self, sentence, highlight).text(**kwargs),
file=stream)
它正在创建一个 TreePrettyPrinter
对象,https://github.com/nltk/nltk/blob/develop/nltk/treeprettyprinter.py#L50
class TreePrettyPrinter(object):
def __init__(self, tree, sentence=None, highlight=()):
if sentence is None:
leaves = tree.leaves()
if (leaves and not any(len(a) == 0 for a in tree.subtrees())
and all(isinstance(a, int) for a in leaves)):
sentence = [str(a) for a in leaves]
else:
# this deals with empty nodes (frontier non-terminals)
# and multiple/mixed terminals under non-terminals.
tree = tree.copy(True)
sentence = []
for a in tree.subtrees():
if len(a) == 0:
a.append(len(sentence))
sentence.append(None)
elif any(not isinstance(b, Tree) for b in a):
for n, b in enumerate(a):
if not isinstance(b, Tree):
a[n] = len(sentence)
sentence.append('%s' % b)
self.nodes, self.coords, self.edges, self.highlight = self.nodecoords(
tree, sentence, highlight)
看起来引发错误的行是 sentence.append('%s' % b)
https://github.com/nltk/nltk/blob/develop/nltk/treeprettyprinter.py#L97
问题是为什么会引发 TypeError?
TypeError: not all arguments converted during string formatting
如果我们仔细看,看起来我们可以使用 print('%s' % b)
最基本的 python 类型
# String
>>> x = 'abc'
>>> type(x)
<class 'str'>
>>> print('%s' % x)
abc
# Integer
>>> x = 123
>>> type(x)
<class 'int'>
>>> print('%s' % x)
123
# Float
>>> x = 1.23
>>> type(x)
<class 'float'>
>>> print('%s' % x)
1.23
# Boolean
>>> x = True
>>> type(x)
<class 'bool'>
>>> print('%s' % x)
True
令人惊讶的是,它甚至适用于列表!
>>> x = ['abc', 'def']
>>> type(x)
<class 'list'>
>>> print('%s' % x)
['abc', 'def']
但是它被tuple
阻碍了!!
>>> x = ('DT', 123)
>>> x = ('abc', 'def')
>>> type(x)
<class 'tuple'>
>>> print('%s' % x)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: not all arguments converted during string formatting
所以如果我们回到 https://github.com/nltk/nltk/blob/develop/nltk/treeprettyprinter.py#L95
处的代码if not isinstance(b, Tree):
a[n] = len(sentence)
sentence.append('%s' % b)
因为我们知道 sentence.append('%s' % b)
无法处理 tuple
,添加一个元组类型检查并以某种方式连接元组中的项目并转换为 str
将产生很好的效果pretty_print
:
if not isinstance(b, Tree):
a[n] = len(sentence)
if type(b) == tuple:
b = '/'.join(b)
sentence.append('%s' % b)
[输出]:
S
_________________|_____________________________
NP VBD IN NP
________|_________________ | | _____|____
the/DT little/JJ yellow/JJ dog/NN barked/VBD at/IN the/DT cat/NN
在不更改nltk
代码的情况下,是否仍然可以获得漂亮的印刷品?
让我们看看 result
即 Tree
对象的样子:
Tree('S', [Tree('NP', [('the', 'DT'), ('little', 'JJ'), ('yellow', 'JJ'), ('dog', 'NN')]), Tree('VBD', [('barked', 'VBD')]), Tree('IN', [('at', 'IN')]), Tree('NP', [('the', 'DT'), ('cat', 'NN')])])
看起来叶子被保存为字符串元组列表,例如[('the', 'DT'), ('cat', 'NN')]
,所以我们可以做一些修改,使其成为字符串列表,例如[('the/DT'), ('cat/NN')]
,这样Tree.pretty_print()
就可以很好玩了。
因为我们知道 Tree.pprint()
有助于将字符串元组连接成我们想要的形式,即
(S
(NP the/DT little/JJ yellow/JJ dog/NN)
(VBD barked/VBD)
(IN at/IN)
(NP the/DT cat/NN))
我们可以简单地输出到括号中的解析字符串,然后用 Tree.fromstring()
:
Tree
对象
from nltk import Tree
Tree.fromstring(str(result)).pretty_print()
决赛:
import nltk
sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"), ("dog", "NN"), ("barked","VBD"), ("at", "IN"), ("the", "DT"), ("cat", "NN")]
pattern = """NP: {<DT>?<JJ>*<NN>}
VBD: {<VBD>}
IN: {<IN>}"""
NPChunker = nltk.RegexpParser(pattern)
result = NPChunker.parse(sentence)
Tree.fromstring(str(result)).pretty_print()
[输出]:
S
_________________|_____________________________
NP VBD IN NP
________|_________________ | | _____|____
the/DT little/JJ yellow/JJ dog/NN barked/VBD at/IN the/DT cat/NN