如何将具有无效 UTF-8 字符的文件名转换回字节?
How to convert filename with invalid UTF-8 characters back to bytes?
如何将 os.listdir
的输出转换为 bytes
的列表(来自 Unicode str
的列表)?即使文件名是无效的 UTF-8,它也必须工作,例如:
$ locale
LANG=
LANGUAGE=
LC_CTYPE=en_US.UTF-8
LC_NUMERIC="POSIX"
LC_TIME="POSIX"
LC_COLLATE="POSIX"
LC_MONETARY="POSIX"
LC_MESSAGES="POSIX"
LC_PAPER="POSIX"
LC_NAME="POSIX"
LC_ADDRESS="POSIX"
LC_TELEPHONE="POSIX"
LC_MEASUREMENT="POSIX"
LC_IDENTIFICATION="POSIX"
LC_ALL=
$ python3
Python 3.4.0 (default, Apr 11 2014, 13:05:11)
[GCC 4.8.2] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> open(b'\x80', 'w')
<_io.TextIOWrapper name=b'\x80' mode='w' encoding='UTF-8'>
>>> os.listdir('.')
['\udc80']
>>> import sys
>>> [fn.encode(sys.getfilesystemencoding()) for fn in os.listdir('.')]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 1, in <listcomp>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udc80' in position 0: surrogates not allowed
>>> [... for fn in os.listdir('.')]
[b'\x80']
那么我需要向上面的 ...
写入什么才能使其正常工作?
请注意,在这种情况下,不能选择重命名文件、使用 Python 2.x 或使用纯 ASCII 文件名。我不是在寻找解决方法,我是在寻找代替 ...
的代码。
使用错误处理程序;在这种情况下,surrogateescape
错误处理程序看起来很合适:
Value: 'surrogateescape'
Meaning: On decoding, replace byte with individual surrogate code ranging from
U+DC80to
U+DCFF. This code will then be turned back into the same byte when the
'surrogateescape'` error handler is used when encoding the data. (See PEP 383 for more.)
os.fsencode()
utility function使用后一个选项;当适用于您的 OS:
时,它使用代理转义错误处理程序编码为 sys.getfilesystemencoding()
Encode filename to the filesystem encoding with 'surrogateescape'
error handler, or 'strict'
on Windows; return bytes
unchanged.
实际上它只会在文件系统编码为 mbcs
时使用 'strict'
,请参阅 os
module source,编解码器仅在 Windows.[=28 上可用=]
演示:
>>> import sys
>>> ld = ['\udc80']
>>> [fn.encode(sys.getfilesystemencoding()) for fn in ld]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 1, in <listcomp>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udc80' in position 0: surrogates not allowed
>>> [fn.encode(sys.getfilesystemencoding(), 'surrogateescape') for fn in ld]
[b'\x80']
>>> import os
>>> [os.fsencode(fn) for fn in ld]
[b'\x80']
>>> [os.fsencode(fn) for fn in os.listdir('.')]
[b'\x80']
反方向转换也有相应的os.fsdecode
如果您只想要 os.listdir
中的文件名(以字节为单位),它有该选项。来自 docs:
path may be either of type str
or of type bytes
. If path is of type bytes
, the filenames returned will also be of type bytes
; in all other circumstances, they will be of type str
.
如何将 os.listdir
的输出转换为 bytes
的列表(来自 Unicode str
的列表)?即使文件名是无效的 UTF-8,它也必须工作,例如:
$ locale
LANG=
LANGUAGE=
LC_CTYPE=en_US.UTF-8
LC_NUMERIC="POSIX"
LC_TIME="POSIX"
LC_COLLATE="POSIX"
LC_MONETARY="POSIX"
LC_MESSAGES="POSIX"
LC_PAPER="POSIX"
LC_NAME="POSIX"
LC_ADDRESS="POSIX"
LC_TELEPHONE="POSIX"
LC_MEASUREMENT="POSIX"
LC_IDENTIFICATION="POSIX"
LC_ALL=
$ python3
Python 3.4.0 (default, Apr 11 2014, 13:05:11)
[GCC 4.8.2] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> open(b'\x80', 'w')
<_io.TextIOWrapper name=b'\x80' mode='w' encoding='UTF-8'>
>>> os.listdir('.')
['\udc80']
>>> import sys
>>> [fn.encode(sys.getfilesystemencoding()) for fn in os.listdir('.')]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 1, in <listcomp>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udc80' in position 0: surrogates not allowed
>>> [... for fn in os.listdir('.')]
[b'\x80']
那么我需要向上面的 ...
写入什么才能使其正常工作?
请注意,在这种情况下,不能选择重命名文件、使用 Python 2.x 或使用纯 ASCII 文件名。我不是在寻找解决方法,我是在寻找代替 ...
的代码。
使用错误处理程序;在这种情况下,surrogateescape
错误处理程序看起来很合适:
Value:
'surrogateescape'
Meaning:On decoding, replace byte with individual surrogate code ranging from
U+DC80to
U+DCFF. This code will then be turned back into the same byte when the
'surrogateescape'` error handler is used when encoding the data. (See PEP 383 for more.)
os.fsencode()
utility function使用后一个选项;当适用于您的 OS:
sys.getfilesystemencoding()
Encode filename to the filesystem encoding with
'surrogateescape'
error handler, or'strict'
on Windows; returnbytes
unchanged.
实际上它只会在文件系统编码为 mbcs
时使用 'strict'
,请参阅 os
module source,编解码器仅在 Windows.[=28 上可用=]
演示:
>>> import sys
>>> ld = ['\udc80']
>>> [fn.encode(sys.getfilesystemencoding()) for fn in ld]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 1, in <listcomp>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udc80' in position 0: surrogates not allowed
>>> [fn.encode(sys.getfilesystemencoding(), 'surrogateescape') for fn in ld]
[b'\x80']
>>> import os
>>> [os.fsencode(fn) for fn in ld]
[b'\x80']
>>> [os.fsencode(fn) for fn in os.listdir('.')]
[b'\x80']
反方向转换也有相应的os.fsdecode
如果您只想要 os.listdir
中的文件名(以字节为单位),它有该选项。来自 docs:
path may be either of type
str
or of typebytes
. If path is of typebytes
, the filenames returned will also be of typebytes
; in all other circumstances, they will be of typestr
.