在 NLTK 中找不到 ghostscript?

Can't find ghostscript in NLTK?

当我尝试使用块模块时,我在玩 NLTK

enter import nltk as nk
Sentence  = "Betty Botter bought some butter, but she said the butter is  bitter, I f I put it in my batter, it will make my batter bitter."
tokens = nk.word_tokenize(Sentence)
tagged = nk.pos_tag(tokens)
entities = nk.chunk.ne_chunk(tagged) 

代码运行正常,当我输入

>> entities 

我收到以下错误消息:

enter code here Out[2]: Tree('S', [Tree('PERSON', [('Betty', 'NNP')]), Tree('PERSON', [('Botter', 'NNP')]), ('bought', 'VBD'), ('some', 'DT'), ('butter', 'NN'), (',', ','), ('but', 'CC'), ('she', 'PRP'), ('said', 'VBD'), ('the', 'DT'), ('butter', 'NN'), ('is', 'VBZ'), ('bitter', 'JJ'), (',', ','), ('I', 'PRP'), ('f', 'VBP'), ('I', 'PRP'), ('put', 'VBD'), ('it', 'PRP'), ('in', 'IN'), ('my', 'PRP$'), ('batter', 'NN'), (',', ','), ('it', 'PRP'), ('will', 'MD'), ('make', 'VB'), ('my', 'PRP$'), ('batter', 'NN'), ('bitter', 'NN'), ('.', '.')])Traceback (most recent call last):

File "C:\Users\QP19\AppData\Local\Continuum\Anaconda2\lib\site-packages\IPython\core\formatters.py", line 343, in __call__
return method()

File "C:\Users\QP19\AppData\Local\Continuum\Anaconda2\lib\site-packages\nltk\tree.py", line 726, in _repr_png_
subprocess.call([find_binary('gs', binary_names=['gswin32c.exe', 'gswin64c.exe'], env_vars=['PATH'], verbose=False)] +

File "C:\Users\QP19\AppData\Local\Continuum\Anaconda2\lib\site-packages\nltk\internals.py", line 602, in find_binary
binary_names, url, verbose))

File "C:\Users\QP19\AppData\Local\Continuum\Anaconda2\lib\site-packages\nltk\internals.py", line 596, in find_binary_iter
url, verbose):

File "C:\Users\QP19\AppData\Local\Continuum\Anaconda2\lib\site-packages\nltk\internals.py", line 567, in find_file_iter
raise LookupError('\n\n%s\n%s\n%s' % (div, msg, div))

LookupError: 

===========================================================================
NLTK was unable to find the gs file!
Use software specific configuration paramaters or set the PATH environment variable.
===========================================================================

根据 to this post,解决方案是安装 Ghostscript,因为分词器试图用它来显示解析树,并且正在寻找 3 个二进制文件之一:

file_names=['gs', 'gswin32c.exe', 'gswin64c.exe']

使用。 但是即使我安装了 ghostscript 并且现在可以在 windows 搜索中找到二进制文件,但我仍然遇到同样的错误。

我需要修复或更新什么?


附加路径信息:

import os; print os.environ['PATH']

Returns:

C:\Users\QP19\AppData\Local\Continuum\Anaconda2\Library\bin;C:\Users\QP19\AppData\Local\Continuum\Anaconda2\Library\bin;C:\Users\QP19\AppData\Local\Continuum\Anaconda2;C:\Users\QP19\AppData\Local\Continuum\Anaconda2\Scripts;C:\Users\QP19\AppData\Local\Continuum\Anaconda2\Library\bin;C:\Users\QP19\AppData\Local\Continuum\Anaconda2\Library\bin;C:\Program Files (x86)\Parallels\Parallels Tools\Applications;C:\WINDOWS\system32;C:\WINDOWS;C:\WINDOWS\System32\Wbem;C:\WINDOWS\System32\WindowsPowerShell\v1.0\;C:\WINDOWS\System32\WindowsPowerShell\v1.0\;C:\Oracle\RPAS14.1\RpasServer\bin;C:\Oracle\RPAS14.1\RpasServer\applib;C:\Program Files (x86)\Java\jre7\bin;C:\Program Files (x86)\Java\jre7\bin\client;C:\Program Files (x86)\Java\jre7\lib;C:\Program Files (x86)\Java\jre7\jre\bin\client;C:\Users\QP19\AppData\Local\Continuum\Anaconda2;C:\Users\QP19\AppData\Local\Continuum\Anaconda2\Scripts;C:\Users\QP19\AppData\Local\Continuum\Anaconda2\Library\bin;  

简而言之:

而不是 >>> entities,这样做:

>>> print entities.__repr__()

或:

>>> import os
>>> from nltk import word_tokenize, pos_tag, ne_chunk
>>> path_to_gs = "C:\Program Files\gs\gs9.19\bin"
>>> os.environ['PATH'] += os.pathsep + path_to_gs
>>> sent = "Betty Botter bought some butter, but she said the butter is  bitter, I f I put it in my batter, it will make my batter bitter."
>>> entities = ne_chunk(pos_tag(word_tokenize(sent)))
>>> entities

中长:

问题在于您试图打印 ne_chunk 的输出,这将触发 ghostscript 以获取 NE 标记句子的字符串和绘图表示,这是一个 nltk.tree.Tree 对象。这将需要 ghostscript,因此您可以使用小部件将其可视化。

让我们一步一步地完成这个过程。

首先当你使用ne_chunk时,你可以直接在顶层引入它:

from nltk import ne_chunk

并且建议为您的导入使用命名空间,即:

from nltk import word_tokenize, pos_tag, ne_chunk

而当您使用 ne_chunk 时,它来自 https://github.com/nltk/nltk/blob/develop/nltk/chunk/init.py

目前还不清楚 pickle 加载的功能是什么,但经过一些检查,我们发现只有一个内置的 NE chunker 不是基于规则的,因为 pickle 二进制文件的名称是 maxent,我们可以假设它是一个统计分块器,所以它很可能来自 NEChunkParser 中的对象: https://github.com/nltk/nltk/blob/develop/nltk/chunk/named_entity.py 。也有 ACE 数据 API 函数,因此作为 pickle 二进制文件的名称。

现在,只要您可以使用 ne_chunk 函数,它实际上就是在调用 NEChunkParser.parse() 函数 returns 一个 nltk.tree.Tree 对象:https://github.com/nltk/nltk/blob/develop/nltk/chunk/named_entity.py#L118

class NEChunkParser(ChunkParserI):
    """
    Expected input: list of pos-tagged words
    """
    def __init__(self, train):
        self._train(train)

    def parse(self, tokens):
        """
        Each token should be a pos-tagged word
        """
        tagged = self._tagger.tag(tokens)
        tree = self._tagged_to_parse(tagged)
        return tree

    def _train(self, corpus):
        # Convert to tagged sequence
        corpus = [self._parse_to_tagged(s) for s in corpus]

        self._tagger = NEChunkParserTagger(train=corpus)

    def _tagged_to_parse(self, tagged_tokens):
        """
        Convert a list of tagged tokens to a chunk-parse tree.
        """
        sent = Tree('S', [])

        for (tok,tag) in tagged_tokens:
            if tag == 'O':
                sent.append(tok)
            elif tag.startswith('B-'):
                sent.append(Tree(tag[2:], [tok]))
            elif tag.startswith('I-'):
                if (sent and isinstance(sent[-1], Tree) and
                    sent[-1].label() == tag[2:]):
                    sent[-1].append(tok)
                else:
                    sent.append(Tree(tag[2:], [tok]))
        return sent

如果我们看一下 nltk.tree.Treeject that's where the ghostscript problems appears when it's trying to call the _repr_png_ function: https://github.com/nltk/nltk/blob/develop/nltk/tree.py#L702:

def _repr_png_(self):
    """
    Draws and outputs in PNG for ipython.
    PNG is used instead of PDF, since it can be displayed in the qt console and
    has wider browser support.
    """
    import os
    import base64
    import subprocess
    import tempfile
    from nltk.draw.tree import tree_to_treesegment
    from nltk.draw.util import CanvasFrame
    from nltk.internals import find_binary
    _canvas_frame = CanvasFrame()
    widget = tree_to_treesegment(_canvas_frame.canvas(), self)
    _canvas_frame.add_widget(widget)
    x, y, w, h = widget.bbox()
    # print_to_file uses scrollregion to set the width and height of the pdf.
    _canvas_frame.canvas()['scrollregion'] = (0, 0, w, h)
    with tempfile.NamedTemporaryFile() as file:
        in_path = '{0:}.ps'.format(file.name)
        out_path = '{0:}.png'.format(file.name)
        _canvas_frame.print_to_file(in_path)
        _canvas_frame.destroy_widget(widget)
        subprocess.call([find_binary('gs', binary_names=['gswin32c.exe', 'gswin64c.exe'], env_vars=['PATH'], verbose=False)] +
                        '-q -dEPSCrop -sDEVICE=png16m -r90 -dTextAlphaBits=4 -dGraphicsAlphaBits=4 -dSAFER -dBATCH -dNOPAUSE -sOutputFile={0:} {1:}'
                        .format(out_path, in_path).split())
        with open(out_path, 'rb') as sr:
            res = sr.read()
        os.remove(in_path)
        os.remove(out_path)
        return base64.b64encode(res).decode()

但是请注意,当您在解释器中使用 >>> entities 时,python 解释器会触发 _repr_png 而不是 __repr__ 是很奇怪的(参见 Purpose of Python's __repr__). It couldn't be how the native CPython interpreter work when trying to print out the representation of an object, so we take a look at Ipython.core.formatters and we see that it allows _repr_png to be fired at https://github.com/ipython/ipython/blob/master/IPython/core/formatters.py#L725 :

class PNGFormatter(BaseFormatter):
    """A PNG formatter.
    To define the callables that compute the PNG representation of your
    objects, define a :meth:`_repr_png_` method or use the :meth:`for_type`
    or :meth:`for_type_by_name` methods to register functions that handle
    this.
    The return value of this formatter should be raw PNG data, *not*
    base64 encoded.
    """
    format_type = Unicode('image/png')

    print_method = ObjectName('_repr_png_')

    _return_type = (bytes, unicode_type)

我们看到当 IPython 初始化一个 DisplayFormatter 对象时,它会尝试激活所有格式化程序:https://github.com/ipython/ipython/blob/master/IPython/core/formatters.py#L66

def _formatters_default(self):
    """Activate the default formatters."""
    formatter_classes = [
        PlainTextFormatter,
        HTMLFormatter,
        MarkdownFormatter,
        SVGFormatter,
        PNGFormatter,
        PDFFormatter,
        JPEGFormatter,
        LatexFormatter,
        JSONFormatter,
        JavascriptFormatter
    ]
    d = {}
    for cls in formatter_classes:
        f = cls(parent=self)
        d[f.format_type] = f
    return d

请注意,在 Ipython 之外,在本机 CPython 解释器中,它只会调用 __repr__ 而不是 _repr_png:

>>> from nltk import ne_chunk
>>> from nltk import word_tokenize, pos_tag, ne_chunk
>>> Sentence  = "Betty Botter bought some butter, but she said the butter is  bitter, I f I put it in my batter, it will make my batter bitter."
>>> sentence  = "Betty Botter bought some butter, but she said the butter is  bitter, I f I put it in my batter, it will make my batter bitter."
>>> entities = ne_chunk(pos_tag(word_tokenize(sentence)))
>>> entities
Tree('S', [Tree('PERSON', [('Betty', 'NNP')]), Tree('PERSON', [('Botter', 'NNP')]), ('bought', 'VBD'), ('some', 'DT'), ('butter', 'NN'), (',', ','), ('but', 'CC'), ('she', 'PRP'), ('said', 'VBD'), ('the', 'DT'), ('butter', 'NN'), ('is', 'VBZ'), ('bitter', 'JJ'), (',', ','), ('I', 'PRP'), ('f', 'VBP'), ('I', 'PRP'), ('put', 'VBD'), ('it', 'PRP'), ('in', 'IN'), ('my', 'PRP$'), ('batter', 'NN'), (',', ','), ('it', 'PRP'), ('will', 'MD'), ('make', 'VB'), ('my', 'PRP$'), ('batter', 'NN'), ('bitter', 'NN'), ('.', '.')])

所以现在解决方案:

解决方案 1:

打印出ne_chunk的字符串输出时,可以使用

>>> print entities.__repr__()

而不是 >>> entities 那样,IPython 应该明确地只调用 __repr__ 而不是调用所有可能的格式化程序。

解决方案 2

如果你真的需要使用 _repr_png_ 来可视化 Tree 对象,那么我们将需要弄清楚如何将 ghostscript 二进制文件添加到 NLTK 环境变量中。

在你的情况下,默认的 nltk.internals 似乎无法找到二进制文件。更具体地说,我们指的是 https://github.com/nltk/nltk/blob/develop/nltk/internals.py#L599

如果我们回到 https://github.com/nltk/nltk/blob/develop/nltk/tree.py#L726,我们会看到,它正在尝试寻找

env_vars=['PATH']

当 NLTK 尝试初始化它的环境变量时,它正在查看 os.environ,参见 https://github.com/nltk/nltk/blob/develop/nltk/internals.py#L495

请注意,find_binary 调用 find_binary_iter,后者调用 find_binary_iter,试图通过获取 os.environ

来查找 env_vars

所以如果我们添加到路径:

>>> import os
>>> from nltk import word_tokenize, pos_tag, ne_chunk
>>> path_to_gs = "C:\Program Files\gs\gs9.19\bin"
>>> os.environ['PATH'] += os.pathsep + path_to_gs

现在这应该适用于 Ipython:

>>> import os
>>> from nltk import word_tokenize, pos_tag, ne_chunk
>>> path_to_gs = "C:\Program Files\gs\gs9.19\bin"
>>> os.environ['PATH'] += os.pathsep + path_to_gs
>>> sent = "Betty Botter bought some butter, but she said the butter is  bitter, I f I put it in my batter, it will make my batter bitter."
>>> entities = ne_chunk(pos_tag(word_tokenize(sent)))
>>> entities

从“https://www.ghostscript.com/download/gsdnld.html”下载 gs.exe 并将其路径添加到 Environment Variables

路径可能存储在

C:\Program Files\

(in my system it looks like "C:\Program Files\gs\gs9.21\bin")

并将其添加到环境变量中:

control panel->system and security->system->advanced system settings->Environment Variables->(in system variables scroll down and double click on path)->

然后添加复制的路径

(in my case "C:\Program Files\gs\gs9.21\bin")

P.S.:不要忘记在处理路径之前添加分号(;)而不是删除现有路径然后简单地放置它在那里,你可能会遇到麻烦,需要 运行 备份 :)

在我的例子中,当我添加具有相同 alvas 代码的路径时,结果是:

'C:\Program Files\gs\gs9.27\x08in'

这是不正确的,所以,我更改为:path_to_gs = 'C:/Program Files/gs/gs9.27/bin' 并且有效。

添加到@predictorx 的评论中。对我有用的是

    path_to_gs = "C:\Program Files\gs\gs9.53.3\bin"
    os.environ['PATH'] += path_to_gs