为什么我在源代码中有无效的 unicode 时会收到此错误(但仅在导入 matplotlib 时)

Why am I getting this error from having nonvalid unicode in source (But only when importing matplotlib)

所以我花了一段时间才找出这个问题的原因,但仍然不知道这是怎么回事。我最近切换到 python3 并且在尝试导入 matplotlib 时遇到了这个巨大的错误:

Traceback (most recent call last):
  File "C:/Users/y2kbugger/Desktop/test.py", line 6, in <module>
  File "C:\Anaconda2\envs\mypackage\lib\site-packages\matplotlib\__init__.py", lin
e 124, in <module>
    from matplotlib.rcsetup import (defaultParams,
  File "C:\Anaconda2\envs\mypackage\lib\site-packages\matplotlib\rcsetup.py", line
 30, in <module>
    from matplotlib.fontconfig_pattern import parse_fontconfig_pattern
  File "C:\Anaconda2\envs\mypackage\lib\site-packages\matplotlib\fontconfig_patter
n.py", line 25, in <module>
    from pyparsing import Literal, ZeroOrMore, \
  File "C:\Anaconda2\envs\mypackage\lib\site-packages\pyparsing.py", line 3539, in
 <module>
    _escapedPunc = Word( _bslash, r"\[]-*.$+^?()~ ", exact=2 ).setParseAction(la
mbda s,l,t:t[0][1])
  File "C:\Anaconda2\envs\mypackage\lib\site-packages\pyparsing.py", line 966, in
setParseAction
    self.parseAction = list(map(_trim_arity, list(fns)))
  File "C:\Anaconda2\envs\mypackage\lib\site-packages\pyparsing.py", line 813, in
_trim_arity
    this_line = extract_stack()[-1]
  File "C:\Anaconda2\envs\mypackage\lib\site-packages\pyparsing.py", line 797, in
extract_stack
    frame_summary = traceback.extract_stack()[offset]
  File "C:\Anaconda2\envs\mypackage\lib\traceback.py", line 207, in extract_stack
    stack = StackSummary.extract(walk_stack(f), limit=limit)
  File "C:\Anaconda2\envs\mypackage\lib\traceback.py", line 358, in extract
    f.line
  File "C:\Anaconda2\envs\mypackage\lib\traceback.py", line 282, in line
    self._line = linecache.getline(self.filename, self.lineno).strip()
  File "C:\Anaconda2\envs\mypackage\lib\linecache.py", line 16, in getline
    lines = getlines(filename, module_globals)
  File "C:\Anaconda2\envs\mypackage\lib\linecache.py", line 47, in getlines
    return updatecache(filename, module_globals)
  File "C:\Anaconda2\envs\mypackage\lib\linecache.py", line 137, in updatecache
    lines = fp.readlines()
  File "C:\Anaconda2\envs\mypackage\lib\codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 308: invali
d start byte

注释掉 import matplotlib as mpl 会导致错误不发生。这让我误入歧途,尝试了 matplot、numpy 等的不同组合。让我感到困惑的部分是,如果我删除评论(我从网上粘贴),错误实际上是固定的。我的编辑是 vim。我猜 utf-8 不是 vim 用来写入文件的编码。

最小错误生成示例:

# -*- coding: utf-8 -*-
import matplotlib as mpl
# Bad character pasted into vim from chrome: –

要修复只需删除 "EN DASH"(或整个第 3 行)并且 matplotlib 正确导入。

那么,为什么评论中的无效 (?) unicode 仅在尝试 import matplotlib(甚至在它到达相关评论之前)时才会导致错误

python==3.5.2

colorama==0.3.7
comtypes==1.1.2
cycler==0.10.0
matplotlib==1.5.1
numpy==1.11.1
pandas==0.18.1
py==1.4.31
pyparsing==2.1.4
pytest==2.9.2
python-dateutil==2.5.3
pytz==2016.6.1
pywin32==220
scikit-learn==0.17.1
scipy==0.18.0
six==1.10.0

问题在pyparsing:

The pyparsing module is an alternative approach to creating and executing simple grammars, vs. the traditional lex/yacc approach, or the use of regular expressions. With pyparsing, you don't need to learn a new syntax for defining grammars or matching expressions - the parsing module provides a library of classes that you use to construct the grammar directly in Python.

为了"construct the grammar directly in Python",pyparsing需要读取定义语法的源文件(本例中为matplotlib源文件)。在通常只是一些无害的额外工作中,pyparsing 不仅读取 matplotlib 源文件,还读取语法定义点堆栈中的所有内容,一直到您拥有 import matplotlib.当它到达您的源文件时它会阻塞,因为您的文件确实不是 UTF-8 格式; 0x96 是破折号的 Windows-1252 (and/or Latin-1) 编码。这个问题(读取太多堆栈)有 already been fixed by the author of pyparsing so the fix should be in the next release of pyparsing(可能是 2.1.8)。

顺便说一下,matplotlib 正在定义一个 pyparsing 语法,以便能够读取 fontconfig 文件,这是一种配置主要在 Linux 上使用的字体的方法。所以 Windows pyparsing 可能甚至不需要使用 matplotlib!