Python, file(1) - 为什么数字 [7,8,9,10,12,13,27] 和范围 (0x20, 0x100) 用于确定文本文件与二进制文件

Python, file(1) - Why are the numbers [7,8,9,10,12,13,27] and range(0x20, 0x100) used for determining text vs binary file

关于solution for determining whether a file is binary or text in python,回答者使用:

textchars = bytearray([7,8,9,10,12,13,27]) + bytearray(range(0x20, 0x100))

然后使用 .translate(None, textchars) 删除(或不替换)以二进制形式读入的文件中的所有此类字符。

回答者还争辩说,这种数字选择是 "based on file(1) behaviour"(表示什么是文本,什么不是)。这些数字的重要意义在于从二进制文件中确定文本文件?

它们代表 printable 文本最常见的代码点,加上换行符、空格和回车符 returns 等。 ASCII 被覆盖到 0x7F,Latin-1 或 Windows Codepage 1251 等标准将剩余的 128 个字节用于重音字符等。

您希望文本使用这些代码点。二进制数据将使用 0x00-0xFF 范围内的 all 代码点;例如文本文件可能不会使用 \x00 (NUL) 或 \x1F(ASCII 标准中的单位分隔符)。

不过,这充其量只是一种启发式方法。某些文本文件可能仍会尝试在明确命名的这 7 个字符之外使用 C0 control codes,并且我确信存在恰好不包含未包含在 textchars 字符串中的 25 字节值的二进制数据。

该范围的作者可能基于 file 命令中的 text_chars table。它将字节标记为非文本、ASCII、Latin-1 或非 ISO 扩展 ASCII,并包含有关为什么选择这些代码点的文档:

/*
 * This table reflects a particular philosophy about what constitutes
 * "text," and there is room for disagreement about it.
 *
 * [....]
 *
 * The table below considers a file to be ASCII if all of its characters
 * are either ASCII printing characters (again, according to the X3.4
 * standard, not isascii()) or any of the following controls: bell,
 * backspace, tab, line feed, form feed, carriage return, esc, nextline.
 *
 * I include bell because some programs (particularly shell scripts)
 * use it literally, even though it is rare in normal text.  I exclude
 * vertical tab because it never seems to be used in real text.  I also
 * include, with hesitation, the X3.64/ECMA-43 control nextline (0x85),
 * because that's what the dd EBCDIC->ASCII table maps the EBCDIC newline
 * character to.  It might be more appropriate to include it in the 8859
 * set instead of the ASCII set, but it's got to be included in *something*
 * we recognize or EBCDIC files aren't going to be considered textual.
 *
 * [.....]
 */

有趣的是,table 不包括 0x7F,您找到的代码没有。