UnicodeEncodeError 如果管道输出到 wc -l

Question

当运行代码：

#! /usr/bin/env python
# -*- coding: UTF-8 -*-

import xml.etree.ElementTree as ET
print ET.fromstring('<?xml version="1.0" encoding="UTF-8" standalone="yes"?><root><road>vägen</road></root>').find('road').text

产生预期的输出 vägen，但是如果将其通过管道传输到 wc -l，我会得到一个 UnicodeEncodeError，例如（TEerr.py 包含上面给出的代码片段）：

:~> ETerr.py | wc -l
Traceback (most recent call last):
  File "./ETerr.py", line 5, in <module>
    print ET.fromstring('<?xml version="1.0" encoding="UTF-8" standalone="yes"?><root><road>vägen</road></root>').find('road').text 
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 1: ordinal not in range(128)
0
:~>

如果代码的输出是否通过管道传输，代码的行为会有何不同？我该如何修复它以使其不发生变化。

请注意，上面的代码片段只是为了用尽可能少的代码来演示问题，在我需要解决问题的实际脚本中，xml 是使用 [=14 检索的=] 因此我无法控制它的格式。

Answer 1

首先，让我指出这不是 Python 3 中的问题，修复它实际上是值得 compatibility-breaking 更改语言的原因之一第一名。但我假设您有充分的理由使用 Python 2，而不能只是升级。

此处的近因（假设您在 POSIX 平台上使用 Python 2.7——在旧的 2.x 和 [=72 上，事情可能会更复杂=]) 是 sys.stdout.encoding 的值。当您启动解释器时，它会执行与此伪代码等效的操作：

if isatty(stdoutfd):
    sys.stdout.encoding = parse_locale(os.environ('LC_CTYPE'))
else:
    sys.stdout.encoding = None

每次你 write 到一个文件，包括 sys.stdout，包括隐含地来自 print 语句，它会做这样的事情：

if isinstance(s, unicode):
    if self.encoding:
        s = s.encode(self.encoding)
    else:
        s = s.encode(sys.getdefaultencoding())

实际代码执行标准的 POSIX 东西寻找像 LANG 这样的回退，并在某些情况下硬编码回退到 UTF-8 Mac OS X 等，但这已经足够接近了。

这在 file.encoding:

下只有稀疏的记录

The encoding that this file uses. When Unicode strings are written to a file, they will be converted to byte strings using this encoding. In addition, when the file is connected to a terminal, the attribute gives the encoding that the terminal is likely to use (that information might be incorrect if the user has misconfigured the terminal). The attribute is read-only and may not be present on all file-like objects. It may also be None, in which case the file uses the system default encoding for converting Unicode strings.

要验证这是您的问题，请尝试以下操作：

$ python -c 'print __import__("sys").stdout.encoding'
UTF-8
$ python -c 'print __import__("sys").stdout.encoding' | cat
None

要特别确定这是问题所在：

$ PYTHONIOENCODING=Latin-1 python -c 'print __import__("sys").stdout.encoding'
Latin-1
$ PYTHONIOENCODING=Latin-1 python -c 'print __import__("sys").stdout.encoding' | cat
Latin-1

那么，你如何解决这个问题？

嗯，显而易见的方法是升级到 Python 3.6，在这两种情况下您都将获得 UTF-8，但我假设您使用 Python 是有原因的2.7 并且不能轻易改变它。

正确的解决方案实际上非常复杂。但是，如果您想要一个适用于您的系统的快速且肮脏的解决方案，并且适用于具有标准 Python 2.7 设置的大多数当前 Linux 和 Mac 系统（即使对于较旧的 Linux 系统、旧的 Python 2.x 版本和奇怪的设置），您可以：

设置环境变量 PYTHONIOENCODING 以覆盖检测并强制使用 UTF-8。如果您知道您将从此帐户中使用的每个终端和每个工具都是 UTF-8，那么在您的 profile 或类似的设置中设置它可能是值得的，但如果这不是真的，那将是一个糟糕的主意.
检查 sys.stdout.encoding 并用 'UTF-8' 编码包装它，如果它是 None。
在您打印的所有内容上明确 .encode('UTF-8')。

UnicodeEncodeError 如果管道输出到 wc -l

UnicodeEncodeError if piping output to wc -l

python

xml

encoding

pipe

wc