从 .txt 文件读取时,撇号出现乱码

Apostrophes come out garbled when read from a .txt file

我在从 .txt 文件读取行时遇到问题。我的文件包含带有

等词语的句子

hadn’t , can’t, didn’t

等等,问题是当我使用 read() 方法时

我有类似的东西:

’

所以我读的单词是 hadn’t 而不是 hadn’t

我的输入:

Love at First Sight

One <adjective> afternoon, I was walking by the <place> when
accidentally I bumped into a <adjective> boy.
At first I blushed and apologized for bumping into him, but when he flashed his
<adjective> smile I just couldn’t help falling in love. His
<adjective> voice telling me that it was ok sounded like music to myears.
I could have stayed there staring at him for <period_of_time>.
He had <adjective> <color> eyes and <adjective>
<color> hair. I thought he was perfect for me. Before I noticed,
<number> <period_of_time> had passed by after I apologized,
and I hadn’t said anything else since!
That’s when I noticed thathe was looking at me
<adverb>. I didn’t know what tosay, so I just <past_verb>.
I noticed him giving me astrange look when he started walking to his
<noun>.I looked back at him <number> more time(s), but hewas already out of sight.
It wasn’t love after all

预期输出:与输入文件相同

我的代码:

f = open('loveatfirstsight.txt','r')
for i in f.readlines():
    print(i)

我的操作系统:Windows10

这听起来像是编码问题。文本文件以 UTF-8 格式存储,其中包含大引号。您要么使用错误的编码(可能是 Latin-1)读取它,要么以 UTF-8 格式将其输出到某个不期望 UTF-8 编码的地方(可能是 Windows 控制台?)。

如果修改问题以包含有关数据存储、读取和处理方式的更多详细信息,包括诸如您使用的系统以及您使用的 Python 版本之类的内容,您就能得到更好的答案。

该文件以 UTF-8 编码,但您正在阅读它,就好像它是(我假设)windows-1252(或其他一些特定于 Windows 的编码)。由于此文件中出现的撇号字符不是典型的 ASCII“打字机撇号”(' U+0027 APOSTROPHE),而是一个“印刷者的撇号”( U+2019 右单引号)位于基本拉丁语 ('ASCII') 块之外,不匹配的编码使字符出现损坏。

>>> 'hadn’t'.encode('utf-8').decode('cp1252')
'hadn’t'

要更正此问题,您应该通过 encoding 参数为 open 函数指定正确的编码。

f = open('loveatfirstsight.txt', 'r', encoding='utf-8')
for i in f.readlines():
    print(i)

正如 help(open) 解释的那样,

In text mode, if encoding is not specified the encoding used is platform dependent: locale.getpreferredencoding(False) is called to get the current locale encoding. (For reading and writing raw bytes use binary mode and leave encoding unspecified.)