Python 将二进制文件转换为字符串，同时忽略非 ASCII 字符

Question

我有一个二进制文件，我想提取所有 ascii 字符，同时忽略非 ascii 字符。目前我有：

with open(filename, 'rb') as fobj:
   text = fobj.read().decode('utf-16-le')
   file = open("text.txt", "w")
   file.write("{}".format(text))
   file.close

但是我在写入文件 UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 0: ordinal not in range(128) 时遇到错误。我如何让 Python 忽略非 ascii？

Answer 1

使用内置的 ASCII 编解码器并告诉它忽略任何错误，例如：

with open(filename, 'rb') as fobj:
   text = fobj.read().decode('utf-16-le')
   file = open("text.txt", "w")
   file.write("{}".format(text.encode('ascii', 'ignore')))
   file.close()

您可以在 Python 解释器中测试和尝试：

>>> s = u'hello \u00a0 there'
>>> s
u'hello \xa0 there'

只是尝试转换为字符串会引发异常。

>>> str(s)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 6: ordinal not in range(128)

...就像尝试将该 unicode 字符串编码为 ASCII 一样：

>>> s.encode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 6: ordinal not in range(128)

...但是告诉编解码器忽略它无法处理的字符没问题：

>>> s.encode('ascii', 'ignore')
'hello  there'

Answer 2

基本上，ASCII table 取 [0, 2⁷) 范围内的值并将它们与（writable 或不）字符相关联。因此，要忽略非 ASCII 字符，您只需忽略代码不包含在 [0, 2⁷) 中的字符，也就是劣于或等于 127.

在python中有一个函数，叫做ord，根据docstring

Return the integer ordinal of a one-character string.

换句话说，它给你一个字符的代码。现在，您必须忽略所有传递给 ord、return 值大于 128 的字符。这可以通过以下方式完成：

with open(filename, 'rb') as fobj:
    text = fobj.read().decode('utf-16-le')
    out_file = open("text.txt", "w")

    # Check every single character of `text`
    for character in text:
        # If it's an ascii character
        if ord(character) < 128:
            out_file.write(character)

    out_file.close

现在，如果您只想保留 printable 个字符，您必须注意到所有这些字符 - 至少在 ASCII table 中 -介于 32 (space) 和 126 (波浪号) 之间，因此您只需执行以下操作：

if 32 <= ord(character) <= 126:

Python 将二进制文件转换为字符串，同时忽略非 ASCII 字符

Python convert binary file into string while ignoring non-ascii characters

python

non-ascii-characters