Python, file(1) - 为什么数字 [7,8,9,10,12,13,27] 和范围 (0x20, 0x100) 用于确定文本文件与二进制文件
Python, file(1) - Why are the numbers [7,8,9,10,12,13,27] and range(0x20, 0x100) used for determining text vs binary file
关于solution for determining whether a file is binary or text in python,回答者使用:
textchars = bytearray([7,8,9,10,12,13,27]) + bytearray(range(0x20, 0x100))
然后使用 .translate(None, textchars)
删除(或不替换)以二进制形式读入的文件中的所有此类字符。
回答者还争辩说,这种数字选择是 "based on file(1) behaviour"(表示什么是文本,什么不是)。这些数字的重要意义在于从二进制文件中确定文本文件?
它们代表 printable 文本最常见的代码点,加上换行符、空格和回车符 returns 等。 ASCII 被覆盖到 0x7F,Latin-1 或 Windows Codepage 1251 等标准将剩余的 128 个字节用于重音字符等。
您希望文本仅使用这些代码点。二进制数据将使用 0x00-0xFF 范围内的 all 代码点;例如文本文件可能不会使用 \x00 (NUL) 或 \x1F(ASCII 标准中的单位分隔符)。
不过,这充其量只是一种启发式方法。某些文本文件可能仍会尝试在明确命名的这 7 个字符之外使用 C0 control codes,并且我确信存在恰好不包含未包含在 textchars
字符串中的 25 字节值的二进制数据。
该范围的作者可能基于 file
命令中的 text_chars
table。它将字节标记为非文本、ASCII、Latin-1 或非 ISO 扩展 ASCII,并包含有关为什么选择这些代码点的文档:
/*
* This table reflects a particular philosophy about what constitutes
* "text," and there is room for disagreement about it.
*
* [....]
*
* The table below considers a file to be ASCII if all of its characters
* are either ASCII printing characters (again, according to the X3.4
* standard, not isascii()) or any of the following controls: bell,
* backspace, tab, line feed, form feed, carriage return, esc, nextline.
*
* I include bell because some programs (particularly shell scripts)
* use it literally, even though it is rare in normal text. I exclude
* vertical tab because it never seems to be used in real text. I also
* include, with hesitation, the X3.64/ECMA-43 control nextline (0x85),
* because that's what the dd EBCDIC->ASCII table maps the EBCDIC newline
* character to. It might be more appropriate to include it in the 8859
* set instead of the ASCII set, but it's got to be included in *something*
* we recognize or EBCDIC files aren't going to be considered textual.
*
* [.....]
*/
有趣的是,table 不包括 0x7F,您找到的代码没有。
关于solution for determining whether a file is binary or text in python,回答者使用:
textchars = bytearray([7,8,9,10,12,13,27]) + bytearray(range(0x20, 0x100))
然后使用 .translate(None, textchars)
删除(或不替换)以二进制形式读入的文件中的所有此类字符。
回答者还争辩说,这种数字选择是 "based on file(1) behaviour"(表示什么是文本,什么不是)。这些数字的重要意义在于从二进制文件中确定文本文件?
它们代表 printable 文本最常见的代码点,加上换行符、空格和回车符 returns 等。 ASCII 被覆盖到 0x7F,Latin-1 或 Windows Codepage 1251 等标准将剩余的 128 个字节用于重音字符等。
您希望文本仅使用这些代码点。二进制数据将使用 0x00-0xFF 范围内的 all 代码点;例如文本文件可能不会使用 \x00 (NUL) 或 \x1F(ASCII 标准中的单位分隔符)。
不过,这充其量只是一种启发式方法。某些文本文件可能仍会尝试在明确命名的这 7 个字符之外使用 C0 control codes,并且我确信存在恰好不包含未包含在 textchars
字符串中的 25 字节值的二进制数据。
该范围的作者可能基于 file
命令中的 text_chars
table。它将字节标记为非文本、ASCII、Latin-1 或非 ISO 扩展 ASCII,并包含有关为什么选择这些代码点的文档:
/*
* This table reflects a particular philosophy about what constitutes
* "text," and there is room for disagreement about it.
*
* [....]
*
* The table below considers a file to be ASCII if all of its characters
* are either ASCII printing characters (again, according to the X3.4
* standard, not isascii()) or any of the following controls: bell,
* backspace, tab, line feed, form feed, carriage return, esc, nextline.
*
* I include bell because some programs (particularly shell scripts)
* use it literally, even though it is rare in normal text. I exclude
* vertical tab because it never seems to be used in real text. I also
* include, with hesitation, the X3.64/ECMA-43 control nextline (0x85),
* because that's what the dd EBCDIC->ASCII table maps the EBCDIC newline
* character to. It might be more appropriate to include it in the 8859
* set instead of the ASCII set, but it's got to be included in *something*
* we recognize or EBCDIC files aren't going to be considered textual.
*
* [.....]
*/
有趣的是,table 不包括 0x7F,您找到的代码没有。