从 .txt 文件读取时，撇号出现乱码

Question

我在从 .txt 文件读取行时遇到问题。我的文件包含带有

等词语的句子

hadn’t , can’t, didn’t

等等，问题是当我使用 read() 方法时

’

我有类似的东西：

â€™

所以我读的单词是 hadnâ€™t 而不是 hadn’t

我的输入：

Love at First Sight

One <adjective> afternoon, I was walking by the <place> when
accidentally I bumped into a <adjective> boy.
At first I blushed and apologized for bumping into him, but when he flashed his
<adjective> smile I just couldn’t help falling in love. His
<adjective> voice telling me that it was ok sounded like music to myears.
I could have stayed there staring at him for <period_of_time>.
He had <adjective> <color> eyes and <adjective>
<color> hair. I thought he was perfect for me. Before I noticed,
<number> <period_of_time> had passed by after I apologized,
and I hadn’t said anything else since!
That’s when I noticed thathe was looking at me
<adverb>. I didn’t know what tosay, so I just <past_verb>.
I noticed him giving me astrange look when he started walking to his
<noun>.I looked back at him <number> more time(s), but hewas already out of sight.
It wasn’t love after all

预期输出：与输入文件相同

我的代码：

f = open('loveatfirstsight.txt','r')
for i in f.readlines():
    print(i)

我的操作系统：Windows10

Answer 1

这听起来像是编码问题。文本文件以 UTF-8 格式存储，其中包含大引号。您要么使用错误的编码（可能是 Latin-1）读取它，要么以 UTF-8 格式将其输出到某个不期望 UTF-8 编码的地方（可能是 Windows 控制台？）。

如果修改问题以包含有关数据存储、读取和处理方式的更多详细信息，包括诸如您使用的系统以及您使用的 Python 版本之类的内容，您就能得到更好的答案。

Answer 2

该文件以 UTF-8 编码，但您正在阅读它，就好像它是（我假设）windows-1252（或其他一些特定于 Windows 的编码）。由于此文件中出现的撇号字符不是典型的 ASCII“打字机撇号”（' U+0027 APOSTROPHE），而是一个“印刷者的撇号”（’ U+2019 右单引号）位于基本拉丁语 ('ASCII') 块之外，不匹配的编码使字符出现损坏。

>>> 'hadn’t'.encode('utf-8').decode('cp1252')
'hadnâ€™t'

要更正此问题，您应该通过 encoding 参数为 open 函数指定正确的编码。

f = open('loveatfirstsight.txt', 'r', encoding='utf-8')
for i in f.readlines():
    print(i)

正如 help(open) 解释的那样，

In text mode, if encoding is not specified the encoding used is platform dependent: locale.getpreferredencoding(False) is called to get the current locale encoding. (For reading and writing raw bytes use binary mode and leave encoding unspecified.)

从 .txt 文件读取时，撇号出现乱码

Apostrophes come out garbled when read from a .txt file

python

character-encoding