Python 从具有奇怪编码的文件中读取字符串

Question

我做了一个 pig latin 翻译器，它接受用户的输入，翻译它，然后 returns 它。我想添加输入文本文件以从中获取文本的功能，但我运行遇到文件未按预期打开的问题。这是我的代码：

from sys import argv
script, filename = argv

file = open(filename, "r")

sentence = file.read()

print sentence

file.close()

问题是当我打印出文件中的信息时，它看起来像这样：

■T h i s   i s   s o m e   t e x t   i n   a   f i l e

而不是这个：

This is some text in a file

我知道我可以通过切片来解决空格和奇数方块字符的问题，但我觉得这是治标不治本，我想了解为什么文本格式很奇怪，这样也许我可以解决问题。

Answer 1

我相信这是一个 Unicode UTF-16 编码的文件，这是“Unicode Byte Order Mark”(BOM)。它也可能是另一种带有 byte-order 标记的编码，但它看起来肯定是 multi-byte 编码。

这也是您在字符之间看到空白的原因。 UTF-16 有效地将每个字符表示为两个字节，但对于像您正在使用的标准 ASCII 字符，字符的另一半是空的（第二个字节是 0）。

试试这个：

from sys import argv
import codecs
script, filename = argv

file = codecs.open(filename, encoding='utf-16')
sentence = file.read()
print sentence
file.close()

用实际的编码替换encoding='utf-16'。您可能只需要尝试一些并进行试验。

Answer 2

嗯 - 最引人注目的解释是您的文件正在正确读取数据。

至于为什么会有奇怪的输出 - 可能有很多原因

但是看起来您正在使用 Python 2（打印语句）- 由于文本显示为

字符字符

我假设您正在阅读的文件是 UNICODE 编码的文本 - 所以 ABC 是 witten \u0065\u0066\u0067

要么解码字节字符串 - 直到一个 Unicode 字符串 - 或者使用 Python 3 并查看 Unicode 问题。

Answer 3

原文件为UTF-16。这是一个编写 UTF-16 文件并使用 open 与 io.open 读取它的示例，它采用编码参数：

#!python2
import io

sentence = u'This is some text in a file'

with io.open('file.txt','w',encoding='utf16') as f:
    f.write(sentence)

with open('file.txt') as f:
    print f.read()

with io.open('file.txt','r',encoding='utf16') as f:
    print f.read()

美国 Windows 7 控制台上的输出：

 ■T h i s   i s   s o m e   t e x t   i n   a   f i l e
This is some text in a file

我猜测，OP 在 Windows 记事本中创建了文本文件并将其保存为 "Unicode"，这是 Microsoft 对 UTF-16 编码的误称。

Answer 4

起初，当我看到每个人都在回复有关 unicode 和 utf 的内容时，我回避阅读并试图修复它，但我坚持要学习 python 中的编程，所以我做了一些研究，主要是这个网站。 The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

这真的很有帮助。所以我能收集到的是我用来编写文本文件的 notepad++，它以 UTF-8 编写，python 以 UTF-16 读取。解决方案是导入编解码器，并像这样使用编解码器函数（如 Will 上面所说）：从 sys 导入 argv 导入编解码器

script, filename = argv

file = codecs.open(filename, encoding = "utf-8")

sentence = file.read()

print sentence

file.close()

Python 从具有奇怪编码的文件中读取字符串

Python Read String from File with Strange Encoding

python

string

character-encoding

file-handling