Python 将二进制文件转换为字符串,同时忽略非 ASCII 字符
Python convert binary file into string while ignoring non-ascii characters
我有一个二进制文件,我想提取所有 ascii 字符,同时忽略非 ascii 字符。目前我有:
with open(filename, 'rb') as fobj:
text = fobj.read().decode('utf-16-le')
file = open("text.txt", "w")
file.write("{}".format(text))
file.close
但是我在写入文件 UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 0: ordinal not in range(128)
时遇到错误。我如何让 Python 忽略非 ascii?
使用内置的 ASCII 编解码器并告诉它忽略任何错误,例如:
with open(filename, 'rb') as fobj:
text = fobj.read().decode('utf-16-le')
file = open("text.txt", "w")
file.write("{}".format(text.encode('ascii', 'ignore')))
file.close()
您可以在 Python 解释器中测试和尝试:
>>> s = u'hello \u00a0 there'
>>> s
u'hello \xa0 there'
只是尝试转换为字符串会引发异常。
>>> str(s)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 6: ordinal not in range(128)
...就像尝试将该 unicode 字符串编码为 ASCII 一样:
>>> s.encode('ascii')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 6: ordinal not in range(128)
...但是告诉编解码器忽略它无法处理的字符没问题:
>>> s.encode('ascii', 'ignore')
'hello there'
基本上,ASCII table 取 [0, 27) 范围内的值并将它们与(writable 或不)字符相关联。因此,要忽略非 ASCII 字符,您只需忽略代码不包含在 [0, 27) 中的字符,也就是劣于或等于 127.
在python中有一个函数,叫做ord
,根据docstring
Return the integer ordinal of a one-character string.
换句话说,它给你一个字符的代码。现在,您必须忽略所有传递给 ord
、return 值大于 128 的字符。这可以通过以下方式完成:
with open(filename, 'rb') as fobj:
text = fobj.read().decode('utf-16-le')
out_file = open("text.txt", "w")
# Check every single character of `text`
for character in text:
# If it's an ascii character
if ord(character) < 128:
out_file.write(character)
out_file.close
现在,如果您只想保留 printable 个字符,您必须注意到所有这些字符 - 至少在 ASCII table 中 -介于 32 (space) 和 126 (波浪号) 之间,因此您只需执行以下操作:
if 32 <= ord(character) <= 126:
我有一个二进制文件,我想提取所有 ascii 字符,同时忽略非 ascii 字符。目前我有:
with open(filename, 'rb') as fobj:
text = fobj.read().decode('utf-16-le')
file = open("text.txt", "w")
file.write("{}".format(text))
file.close
但是我在写入文件 UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 0: ordinal not in range(128)
时遇到错误。我如何让 Python 忽略非 ascii?
使用内置的 ASCII 编解码器并告诉它忽略任何错误,例如:
with open(filename, 'rb') as fobj:
text = fobj.read().decode('utf-16-le')
file = open("text.txt", "w")
file.write("{}".format(text.encode('ascii', 'ignore')))
file.close()
您可以在 Python 解释器中测试和尝试:
>>> s = u'hello \u00a0 there'
>>> s
u'hello \xa0 there'
只是尝试转换为字符串会引发异常。
>>> str(s)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 6: ordinal not in range(128)
...就像尝试将该 unicode 字符串编码为 ASCII 一样:
>>> s.encode('ascii')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 6: ordinal not in range(128)
...但是告诉编解码器忽略它无法处理的字符没问题:
>>> s.encode('ascii', 'ignore')
'hello there'
基本上,ASCII table 取 [0, 27) 范围内的值并将它们与(writable 或不)字符相关联。因此,要忽略非 ASCII 字符,您只需忽略代码不包含在 [0, 27) 中的字符,也就是劣于或等于 127.
在python中有一个函数,叫做ord
,根据docstring
Return the integer ordinal of a one-character string.
换句话说,它给你一个字符的代码。现在,您必须忽略所有传递给 ord
、return 值大于 128 的字符。这可以通过以下方式完成:
with open(filename, 'rb') as fobj:
text = fobj.read().decode('utf-16-le')
out_file = open("text.txt", "w")
# Check every single character of `text`
for character in text:
# If it's an ascii character
if ord(character) < 128:
out_file.write(character)
out_file.close
现在,如果您只想保留 printable 个字符,您必须注意到所有这些字符 - 至少在 ASCII table 中 -介于 32 (space) 和 126 (波浪号) 之间,因此您只需执行以下操作:
if 32 <= ord(character) <= 126: