Python line.replace returns UnicodeEncodeError

Question

我有一个 tex 文件，它是使用 Sphinx 从第一个源代码生成的，它被编码为没有 BOM 的 UTF-8（根据 Notepad++）并命名为 final_report.tex，包含以下内容：

% Generated by Sphinx.
\documentclass[letterpaper,11pt,english]{sphinxmanual}
\usepackage[utf8]{inputenc}
\begin{document}

\chapter{Preface}
Krimson4 is a nice programming language.
Some umlauts äöüßÅö.
That is an “double quotation mark” problem.
Johnny’s apostrophe allows connecting multiple ports.
Components that include data that describe how they ellipsis …
Software interoperability – some dash – is not ok.
\end{document}

现在，在将 tex 源代码编译为 pdf 之前，我想替换 tex 文件中的一些行以获得更好的结果。我的脚本灵感来自 another SO question.

#!/usr/bin/python
# -*- coding: utf-8 -*-
import os

newFil=os.path.join("build", "latex", "final_report.tex-new")
oldFil=os.path.join("build", "latex", "final_report.tex")

def freplace(old, new):
    with open(newFil, "wt", encoding="utf-8") as fout:
        with open(oldFil, "rt", encoding="utf-8") as fin:
            for line in fin:
                print(line)
                fout.write(line.replace(old, new))
    os.remove(oldFil)
    os.rename(newFil, oldFil)

freplace('\documentclass[letterpaper,11pt,english]{sphinxmanual}', '\documentclass[letterpaper, 11pt, english]{book}')

这适用于 Ubuntu 16.04 和 Python 2.7 以及 Python 3.5，但它在 Windows 和 Python 3.4 上失败。我收到的错误消息是：

File "C:\Python34\lib\encodings\cp850.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u201c' in position 11: character maps to <undefined>

其中201c代表左双引号。如果我删除有问题的字符，脚本会继续执行直到找到下一个有问题的字符。

最后，我需要一个适用于 Linux 和 Windows 以及 Python 2.7 和 3.x 的解决方案。我在 SO 上尝试了很多建议的解决方案，但还没有找到适合我的解决方案...

Answer 1

您需要使用 encoding="the_encoding":

指定正确的编码

with open(oldFil, "rt", encoding="utf-8") as fin,  open(newFil, "wt", encoding="utf-8") as fout:

如果您不这样做，将使用首选编码。

open

在文本模式下，如果未指定编码，则使用的编码取决于平台：调用 locale.getpreferredencoding(False) 以获取当前语言环境编码

Python line.replace returns UnicodeEncodeError

Python line.replace returns UnicodeEncodeError

python

python-unicode