从 .txt 文件读取时,撇号出现乱码
Apostrophes come out garbled when read from a .txt file
我在从 .txt 文件读取行时遇到问题。我的文件包含带有
等词语的句子
hadn’t , can’t, didn’t
等等,问题是当我使用 read()
方法时
’
我有类似的东西:
’
所以我读的单词是 hadn’t
而不是 hadn’t
我的输入:
Love at First Sight
One <adjective> afternoon, I was walking by the <place> when
accidentally I bumped into a <adjective> boy.
At first I blushed and apologized for bumping into him, but when he flashed his
<adjective> smile I just couldn’t help falling in love. His
<adjective> voice telling me that it was ok sounded like music to myears.
I could have stayed there staring at him for <period_of_time>.
He had <adjective> <color> eyes and <adjective>
<color> hair. I thought he was perfect for me. Before I noticed,
<number> <period_of_time> had passed by after I apologized,
and I hadn’t said anything else since!
That’s when I noticed thathe was looking at me
<adverb>. I didn’t know what tosay, so I just <past_verb>.
I noticed him giving me astrange look when he started walking to his
<noun>.I looked back at him <number> more time(s), but hewas already out of sight.
It wasn’t love after all
预期输出:与输入文件相同
我的代码:
f = open('loveatfirstsight.txt','r')
for i in f.readlines():
print(i)
我的操作系统:Windows10
这听起来像是编码问题。文本文件以 UTF-8 格式存储,其中包含大引号。您要么使用错误的编码(可能是 Latin-1)读取它,要么以 UTF-8 格式将其输出到某个不期望 UTF-8 编码的地方(可能是 Windows 控制台?)。
如果修改问题以包含有关数据存储、读取和处理方式的更多详细信息,包括诸如您使用的系统以及您使用的 Python 版本之类的内容,您就能得到更好的答案。
该文件以 UTF-8 编码,但您正在阅读它,就好像它是(我假设)windows-1252(或其他一些特定于 Windows 的编码)。由于此文件中出现的撇号字符不是典型的 ASCII“打字机撇号”('
U+0027 APOSTROPHE),而是一个“印刷者的撇号”(’
U+2019 右单引号)位于基本拉丁语 ('ASCII') 块之外,不匹配的编码使字符出现损坏。
>>> 'hadn’t'.encode('utf-8').decode('cp1252')
'hadn’t'
要更正此问题,您应该通过 encoding
参数为 open
函数指定正确的编码。
f = open('loveatfirstsight.txt', 'r', encoding='utf-8')
for i in f.readlines():
print(i)
正如 help(open)
解释的那样,
In text mode, if encoding
is not specified the encoding used is platform
dependent: locale.getpreferredencoding(False)
is called to get the
current locale encoding. (For reading and writing raw bytes use binary
mode and leave encoding
unspecified.)
我在从 .txt 文件读取行时遇到问题。我的文件包含带有
等词语的句子hadn’t , can’t, didn’t
等等,问题是当我使用 read()
方法时
’
我有类似的东西:
’
所以我读的单词是 hadn’t
而不是 hadn’t
我的输入:
Love at First Sight
One <adjective> afternoon, I was walking by the <place> when
accidentally I bumped into a <adjective> boy.
At first I blushed and apologized for bumping into him, but when he flashed his
<adjective> smile I just couldn’t help falling in love. His
<adjective> voice telling me that it was ok sounded like music to myears.
I could have stayed there staring at him for <period_of_time>.
He had <adjective> <color> eyes and <adjective>
<color> hair. I thought he was perfect for me. Before I noticed,
<number> <period_of_time> had passed by after I apologized,
and I hadn’t said anything else since!
That’s when I noticed thathe was looking at me
<adverb>. I didn’t know what tosay, so I just <past_verb>.
I noticed him giving me astrange look when he started walking to his
<noun>.I looked back at him <number> more time(s), but hewas already out of sight.
It wasn’t love after all
预期输出:与输入文件相同
我的代码:
f = open('loveatfirstsight.txt','r')
for i in f.readlines():
print(i)
我的操作系统:Windows10
这听起来像是编码问题。文本文件以 UTF-8 格式存储,其中包含大引号。您要么使用错误的编码(可能是 Latin-1)读取它,要么以 UTF-8 格式将其输出到某个不期望 UTF-8 编码的地方(可能是 Windows 控制台?)。
如果修改问题以包含有关数据存储、读取和处理方式的更多详细信息,包括诸如您使用的系统以及您使用的 Python 版本之类的内容,您就能得到更好的答案。
该文件以 UTF-8 编码,但您正在阅读它,就好像它是(我假设)windows-1252(或其他一些特定于 Windows 的编码)。由于此文件中出现的撇号字符不是典型的 ASCII“打字机撇号”('
U+0027 APOSTROPHE),而是一个“印刷者的撇号”(’
U+2019 右单引号)位于基本拉丁语 ('ASCII') 块之外,不匹配的编码使字符出现损坏。
>>> 'hadn’t'.encode('utf-8').decode('cp1252')
'hadn’t'
要更正此问题,您应该通过 encoding
参数为 open
函数指定正确的编码。
f = open('loveatfirstsight.txt', 'r', encoding='utf-8')
for i in f.readlines():
print(i)
正如 help(open)
解释的那样,
In text mode, if
encoding
is not specified the encoding used is platform dependent:locale.getpreferredencoding(False)
is called to get the current locale encoding. (For reading and writing raw bytes use binary mode and leaveencoding
unspecified.)