Python 将二进制文件转换为字符串,同时忽略非 ASCII 字符

Python convert binary file into string while ignoring non-ascii characters

我有一个二进制文件,我想提取所有 ascii 字符,同时忽略非 ascii 字符。目前我有:

with open(filename, 'rb') as fobj:
   text = fobj.read().decode('utf-16-le')
   file = open("text.txt", "w")
   file.write("{}".format(text))
   file.close

但是我在写入文件 UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 0: ordinal not in range(128) 时遇到错误。我如何让 Python 忽略非 ascii?

使用内置的 ASCII 编解码器并告诉它忽略任何错误,例如:

with open(filename, 'rb') as fobj:
   text = fobj.read().decode('utf-16-le')
   file = open("text.txt", "w")
   file.write("{}".format(text.encode('ascii', 'ignore')))
   file.close()

您可以在 Python 解释器中测试和尝试:

>>> s = u'hello \u00a0 there'
>>> s
u'hello \xa0 there'

只是尝试转换为字符串会引发异常。

>>> str(s)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 6: ordinal not in range(128)

...就像尝试将该 unicode 字符串编码为 ASCII 一样:

>>> s.encode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 6: ordinal not in range(128)

...但是告诉编解码器忽略它无法处理的字符没问题:

>>> s.encode('ascii', 'ignore')
'hello  there'

基本上,ASCII table 取 [0, 27) 范围内的值并将它们与(writable 或不)字符相关联。因此,要忽略非 ASCII 字符,您只需忽略代码不包含在 [0, 27) 中的字符,也就是劣于或等于 127.

在python中有一个函数,叫做ord,根据docstring

Return the integer ordinal of a one-character string.

换句话说,它给你一个字符的代码。现在,您必须忽略所有传递给 ord、return 值大于 128 的字符。这可以通过以下方式完成:

with open(filename, 'rb') as fobj:
    text = fobj.read().decode('utf-16-le')
    out_file = open("text.txt", "w")

    # Check every single character of `text`
    for character in text:
        # If it's an ascii character
        if ord(character) < 128:
            out_file.write(character)

    out_file.close

现在,如果您只想保留 printable 个字符,您必须注意到所有这些字符 - 至少在 ASCII table 中 -介于 32 (space) 和 126 (波浪号) 之间,因此您只需执行以下操作:

if 32 <= ord(character) <= 126: