替换 python 打开模块中的 iso-8859-1 编码符号

Question

如何解码 open 函数中的 iso-8859-1 符号。

filename = open(f'/opt/PATH/{shorter}', 'r', encoding='iso-8859-1')
file_content = filename.read()
filename.close()

这给了我 ÿ（我猜这是逗号）：

[...]
11 Dir(s) 3ÿ016ÿ011ÿ776 bytes free
[...]

Answer 1

改用ISO-8859-2：

filename = open(f'/opt/PATH/{shorter}', 'r', encoding='iso-8859-2')

您也可以尝试 cp1250 或 windows-1250。 Windows-1250 代码页与 ISO-8859-2 略有不同。

ascii 指的是 7 位 US-ASCII 代码页。该代码页也无法打开您的文件。

如果您在 cmd shell 中使用了 dir，请在执行脚本之前使用 chcp 65001 切换到 UTF8。或改用 Powershell Core。

代码页

正如我在评论中解释的那样，ÿ 不是某种编码符号。 Single-byte 代码页，如 Latin1（又名 ISO-8859-1），Central/Eastern 欧洲代码页，如 ISO-8859-2、西里尔文等，只是将字符编码为单字节值。编码符号仅出现在 Unicode 和标记语言中，例如 HTML 和 XML.

您发布的 ÿ 字符在 ISO-8859-1 中编码为 255 (0xFF)。在旧的 IBM DOS 代码页 437 或 852 中，该字节对应于 non-breaking space。在包括东欧在内的其他 ISO-8859- 代码页中 ISO-8859-2 该值用于点。

发生了什么

我怀疑该文件是通过重定向 Windows' cmd shell 上的 dir 输出创建的。 Powershell 中的 dir 没有此页脚。 dir in cmd 将使用用户（您的）区域设置来格式化日期和数字。这意味着如果有人使用自定义格式，您可能会得到不同的结果。 Linux shells 也允许这样的本地化和定制。

cmd shell 是非 -Unicode，所以当您重定向输出时 shell 使用当前代码页，匹配用户的区域设置，以对字节值进行编码。要使用 UTF8，您必须显式更改 shell 代码页

chcp 65001

更好的选择

Windows Terminal 和 Powershell 像 Windows 本身一样默认使用 Unicode，不会出现此类问题。重定向甚至允许您指定编码，将结果作为对象或表格处理，甚至将数据输出为 CSV 或 HTML。 cmd 本质上是一个遗产 shell.

在 Powershell/Powershell Core 中，您可以使用：

Dir | Export-CSV C:\Users\username\Desktop\FileList.csv

将目录列表导出为 UTF8 格式正确的 CSV 文件

Answer 2

这是一个 mojibake 案例：

`cmd`

>NUL chcp 852
>dir_cp852.txt dir /C
type dir_cp852.txt | find /I "bytes free"

              28 Dir(s)  832 467 206 144 bytes free

>NUL chcp 1252
type dir_cp852.txt | find /I "bytes free"

              28 Dir(s)  832ÿ467ÿ206ÿ144 bytes free

Python

with open('dir_cp852.txt', 'r', encoding='iso-8859-1') as filename:
    file_content = filename.read()

print(file_content[-52:])

              28 Dir(s)  832ÿ467ÿ206ÿ144 bytes free

解法：

with open('dir_cp852.txt', 'r', encoding='cp852') as filename:
    file_content = filename.read()

print(file_content[-52:])

              28 Dir(s)  832 467 206 144 bytes free

注意 file_content[-52:]（在Python提示中）：

'              28 Dir(s)  832\xa0467\xa0206\xa0144 bytes free\n'

在mojibake中显示字符：\xa0 (U+00A0, No-Break Space ) 代码 0xFF 在 Code page 852 中（以及更多 MS-DOS 代码页）。

请注意上面 dir /C 中的 /C 开关（在文件大小中显示千位分隔符）。我已经通过（全局定义）set "DIRCMD=/-C".

覆盖了默认值

文件大小的千位分隔符定义在Control Panel\Clock and Region -> 区域:
reg query "HKCU\Control Panel\International" /v sThousand

替换 python 打开模块中的 iso-8859-1 编码符号

Replace iso-8859-1 encoding symbols in python open module

python

encoding

decode

utf-8

iso-8859-1

`cmd`

Python