在使用 utf-8 编码后，' 而不是 Natural Reader 中的 '

Question

我有一些来自网络的文字。处理后，用

写入txt文件

text_file = open("input.txt", "w")
text_file.write(finaltext.encode('utf-8'))
text_file.close()

当我打开 txt 文件时，一切正常。但是当我将它加载到 Natural Reader 以变成音频时。我看到 â€™ 而不是 ' 只在一些而不是所有 '

怎么办？

Answer 1

如果您使用本机文本编辑器打开文件并且它看起来不错，则问题可能与您的其他程序有关，该程序未正确检测编码并且 mojibaking it up. As mentioned in comments, it's almost assuredly a Unicode quote character 看起来像 ' 但不是。

my_string = ('The Knights who say '
    '\N{LEFT SINGLE QUOTATION MARK}'
    'Ni!'
    '\N{RIGHT SINGLE QUOTATION MARK}'
)
def print_repr_escaped(x):
    print(repr(x.encode('unicode_escape').decode('ascii')))

print_repr_escaped(my_string)
# 'The Knights who say \u2018Ni!\u2019'

如果您无法控制其他程序的编码，您有 2 个选择：

删除所有 Unicode 字符 like so:

stripped = my_string.encode('ascii', 'ignore').decode('ascii')
print_repr_escaped(stripped)
# 'The Knights who say Ni!'

尝试用 Unidecode

之类的东西将 Unicode 字符转换为 ASCII

import unidecode

converted = unidecode.unidecode(my_string)
print_repr_escaped(converted)
# "The Knights who say 'Ni!'"

Answer 2

如果您使用 Windows，许多 Windows 应用程序假定文件使用本机 ANSI 编码，除非文件开头有字节顺序标记 (BOM)。 BOM 对于 UTF-8 通常不是必需的，但在 Windows 上用作 UTF-8 文件的签名。您可以使用 utf-8-sig 编解码器编写一个。以下将适用于 Python 2.x 和 3.x:

import io
with io.open("input.txt", "w", encoding='utf-8-sig') as text_file:
    text_file.write(finaltext)

在使用 utf-8 编码后，' 而不是 Natural Reader 中的 '

â€™ instead of ' in Natural Reader after encoding with utf-8

python

encode

utf-8