在 Ubuntu 14.4 和 Python 2.7 中安装 NLTK 时出现异常错误

Exception error while installing NLTK in Ubuntu 14.4 and Python 2.7

我在安装 NLTK 时遇到以下错误 运行 这个命令

sudo pip install -U nltk

异常:

Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/pip-6.0.8-py2.7.egg/pip/basecommand.py", line 232, in main
    status = self.run(options, args)
  File "/usr/local/lib/python2.7/dist-packages/pip-6.0.8-py2.7.egg/pip/commands/install.py", line 305, in run
    name, None, isolated=options.isolated_mode,
  File "/usr/local/lib/python2.7/dist-packages/pip-6.0.8-py2.7.egg/pip/req/req_install.py", line 181, in from_line
    isolated=isolated)
  File "/usr/local/lib/python2.7/dist-packages/pip-6.0.8-py2.7.egg/pip/req/req_install.py", line 54, in init
    req = pkg_resources.Requirement.parse(req)
  File "/usr/local/lib/python2.7/dist-packages/pip-6.0.8-py2.7.egg/pip/_vendor/pkg_resources/init.py", line 2873, in parse
    reqs = list(parse_requirements(s)) 
  File "/usr/local/lib/python2.7/dist-packages/pip-6.0.8-py2.7.egg/pip/_vendor/pkg_resources/init.py", line 2807, in parse_requirements
    raise ValueError("Missing distribution spec", line)
ValueError: ('Missing distribution spec', '\xe2\x80\x90U')

很可能,您从某个网页(或 PDF)中剪切并粘贴了这一行。该网页可能更改了 ASCII hyphen (minus sign) into a U+2010 HYPHEN.

因此,您需要输入命令为

sudo pip install -U nltk   # This uses the usual ASCII hyphen

剪切和粘贴:

sudo pip install ‐U nltk   # This uses U+2010 HYPHEN

错误信息

raise ValueError("Missing distribution spec", line)
ValueError: ('Missing distribution spec', '\xe2\x80\x90U')

意味着 line 是字符串 '\xe2\x80\x90U'.

这个 utf-8 编码的字符串看起来像一个连字符和一个 U:

In [90]: print('\xe2\x80\x90U')
‐U    

但请注意,这个连字符不是通常的 ASCII 连字符 (chr(45))。 相反,它是 utf-8 编码的 U+2010 HYPHEN:

In [93]: import unicodedata as UD

In [95]: UD.name('\xe2\x80\x90'.decode('utf-8'))
Out[95]: 'HYPHEN'

In [102]: hex(ord('\xe2\x80\x90'.decode('utf-8')))
Out[102]: '0x2010'

与通常的连字符 chr(45) 相反,它是(utf-8 编码的)U+002D HYPHEN-MINUS:

In [97]: UD.name('-'.decode('utf-8'))
Out[97]: 'HYPHEN-MINUS'

In [4]: hex(ord('-'.decode('utf-8')))
Out[4]: '0x2d'

In [98]: ord('-')
Out[98]: 45