在 NLTK 中为 python 使用 stanford 解析器 API 时如何解决 UnicodeDecodeError？

Question

我想使用 Python 的 stanford 解析器，我使用 Windows 7，我已经安装了 Python 2.7 和 nltk 3.0，我从官方网站下载了 stanford 解析器.

我解决了 javahome 环境问题，然后收到此错误消息：

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 0: ordinal not in range(128)

我找不到这个问题的解决方案。

我使用了下一个代码：

# -*- coding: utf-8 -*-

from nltk.parse import stanford

parser = stanford.StanfordParser(model_path='C:\Program Files (x86)\stanford-parser-full-2015-01-30\edu\stanford\nlp\models\lexparser\englishPCFG.ser.gz')

sent = 'my name is zim'
parser.parse(sent)

我在堆栈溢出中寻找解决方案，但没有找到。

Answer 1

0xe9 不是有效的 ASCII 字节，因此您的 englishPCFG.ser.gz 不能进行 ASCII 编码。您需要弄清楚它使用的是什么编码（可能是 UTF-8）并使用 encoding 关键字参数告诉 StanfordParser()。

Answer 2

如果 os.environ 或 export 路径设置正确，如下所述：Stanford Parser and NLTK，那么它应该是

的问题

在 NLTK 中指定编码 API AND
输入字符串的编码

所以解决方案是：

将 NLTK 更新到最新的稳定版本，即 sudo pip install -U nltk
使用python3!!!!或指定字符串的编码

如果您无法更新 python 或 NLTK，那么：

指定在 NLTK 中使用 Stanford API 时的编码（因为 https://github.com/nltk/nltk/issues/877）
指定字符串的编码（参见）

强烈建议您使用python3，尤其是在处理文本输入时。

如果一切都失败了，而你只有旧版本的 NLTK 并且你必须以某种方式使用 py2.7，那么：

import six
from nltk.parse import stanford

path_to_model = "C:\Program Files (x86)\stanford-parser-full-2015-01-30\edu\stanford\nlp\models\lexparser\englishPCFG.ser.gz"

parser = stanford.StanfordParser(model_path=path_to_model, encoding='utf8')

sent = six.text_type('my name is zim')
parser.parse(sent)

请参阅 six 文档 @ http://pythonhosted.org//six/#six.text_type

Answer 3

我已经找到导致我遇到的错误的问题

raise OSError('Java command failed : ' + str(cmd)) OSError: Java command failed :...

此错误是由于以下指令中地址的错误解释造成的：

parser = stanford.StanfordParser(model_path='C:\Program Files (x86)\stanford-parser-full-2015-01-30\edu\stanford\nlp\models\lexparser\englishPCFG.ser.gz').

Python 或 Java 将 ...\nlp\.. 解释为 \n lp\...，因此找不到路径。

我尝试了一个简单的解决方案，我将文件夹重命名为 nlp。它奏效了！

在 NLTK 中为 python 使用 stanford 解析器 API 时如何解决 UnicodeDecodeError？

How to solve the UnicodeDecodeError when using stanford parser API in NLTK for python?

python

unicode

character-encoding

nltk

stanford-nlp