将 python 中字节表示的文件读取为 utf-8 字符

Question

我有一个由 Windows OS 中的内置工具生成的 .txt 文件，我需要在 python 脚本（在 Linux机器（如果相关）。

我这样打开文件：

with open(path, 'r') as spec_file:

我什至尝试了 io 库

io.open(detail, mode="r", encoding="utf-8") as spec_file:

当文件在（例如）Sublime 文本中打开时，文件会正确显示，当逐行遍历文件时：

for line in spec_file:

并打印 (print(line)) 我也得到了正确的表示：

**********************************************************************************
* This diagnostic information may be used by an IT administrator to troubleshoot *
* the installed Trusted Platform Module (TPM). Please zip the folder and attach  *
* it to issues filed through Feedback Hub or with an IT admin.                   *
**********************************************************************************

然而，当打印为 print(repr(line)) 时，我只得到 char 字节表示：

'*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00\n'
'\x00\n'
'\x00*\x00 \x00T\x00h\x00i\x00s\x00 \x00d\x00i\x00a\x00g\x00n\x00o\x00s\x00t\x00i\x00c\x00 \x00i\x00n\x00f\x00o\x00r\x00m\x00a\x00t\x00i\x00o\x00n\x00 \x00m\x00a\x00y\x00 \x00b\x00e\x00 \x00u\x00s\x00e\x00d\x00 \x00b\x00y\x00 \x00a\x00n\x00 \x00I\x00T\x00 \x00a\x00d\x00m\x00i\x00n\x00i\x00s\x00t\x00r\x00a\x00t\x00o\x00r\x00 \x00t\x00o\x00 \x00t\x00r\x00o\x00u\x00b\x00l\x00e\x00s\x00h\x00o\x00o\x00t\x00 \x00*\x00\n'
'\x00\n'
'\x00*\x00 \x00t\x00h\x00e\x00 \x00i\x00n\x00s\x00t\x00a\x00l\x00l\x00e\x00d\x00 \x00T\x00r\x00u\x00s\x00t\x00e\x00d\x00 \x00P\x00l\x00a\x00t\x00f\x00o\x00r\x00m\x00 \x00M\x00o\x00d\x00u\x00l\x00e\x00 \x00(\x00T\x00P\x00M\x00)\x00.\x00 \x00P\x00l\x00e\x00a\x00s\x00e\x00 \x00z\x00i\x00p\x00 \x00t\x00h\x00e\x00 \x00f\x00o\x00l\x00d\x00e\x00r\x00 \x00a\x00n\x00d\x00 \x00a\x00t\x00t\x00a\x00c\x00h\x00 \x00 \x00*\x00\n'
'\x00\n'
'\x00*\x00 \x00i\x00t\x00 \x00t\x00o\x00 \x00i\x00s\x00s\x00u\x00e\x00s\x00 \x00f\x00i\x00l\x00e\x00d\x00 \x00t\x00h\x00r\x00o\x00u\x00g\x00h\x00 \x00F\x00e\x00e\x00d\x00b\x00a\x00c\x00k\x00 \x00H\x00u\x00b\x00 \x00o\x00r\x00 \x00w\x00i\x00t\x00h\x00 \x00a\x00n\x00 \x00I\x00T\x00 \x00a\x00d\x00m\x00i\x00n\x00.\x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00*\x00\n'
'\x00\n'
'\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00\n'

因此无法搜索文件并将其作为字符串使用，因此我需要以某种方式将其转换为 utf-8 字符串，知道如何实现吗？

Answer 1

您的文件以 UTF-16 LE 编码（因为 Windows，请参阅 this question 了解更多信息），因此您需要将其设置为编码：

with open(path, 'r', encoding="utf-16le") as spec_file:

LE 代表 Little Endian，这很重要，因为常规的“utf-16”检查字节顺序标记，Windows 不会输出（同样，因为 Windows），所以你需要明确说明字节顺序。

将 python 中字节表示的文件读取为 utf-8 字符

Read byte represented file in python as utf-8 characters

python

string

bytebuffer

utf-8