如果字符串已定义为 r'string' 而不是 b'string'，则将字符串解码为 UTF-8

Question

从文件中读取路径字符串，该文件包含这样的字符串，这些字符串已经转义了特殊的 unicode 字符：/WAY-ALPHA2019-Espan43ol-Episodio-01.mp4

我需要将该字符串转换为： /WAY-ALPHA2019-Español-Episodio-01.mp4

这里有一些代码演示了我正在尝试做的事情：

>>> stringa = r'/WAY-ALPHA2019-Espan43ol-Episodio-01.mp4'
>>> stringb = b'/WAY-ALPHA2019-Espan43ol-Episodio-01.mp4'

>>> print(stringa)
/WAY-ALPHA2019-Espan43ol-Episodio-01.mp4
>>> print(stringb)
b'/WAY-ALPHA2019-Espan\xcc\x83ol-Episodio-01.mp4'

>>> print(stringa.decode('utf8'))
Traceback (most recent call last):
  File "C:\Users\arlin\AppData\Local\Programs\Python\Python310-32\lib\code.py", line 90, in runcode
    exec(code, self.locals)
  File "<input>", line 1, in <module>
AttributeError: 'str' object has no attribute 'decode'. Did you mean: 'encode'?

>>> print(stringb.decode('utf8'))
/WAY-ALPHA2019-Español-Episodio-01.mp4

Answer 1

试试这个：

import re
re.sub(rb'\([0-7]{3})', lambda match: bytes([int(match[1], 8)]), stringa.encode('ascii')).decode('utf-8')

解释：

我们使用正则表达式 rb'\([0-7]{3})'（匹配文字反斜杠 \ 后跟恰好 3 个八进制数字）并用三位数代码 (match[1]) 替换每个出现的地方，将其解释为八进制数 (int(_, 8))，然后用单个字节 (bytes([_])) 替换原始转义序列。

我们需要对字节进行操作，因为转义码是原始字节，而不是 unicode 字符。只有将这些序列“转义”后，我们才能将 UTF-8 解码为字符串。

Answer 2

我想通了。
@Jasmijn 的代码有一个 bug/typo。这是工作代码：
更新：在我的例子中，old_string 可能包含 utf-8 字符，所以我不得不将 .encode('ascii') 更改为 .encode('utf-8')，这对我仍然有效。

import re
new_string = re.sub(rb'\([0-7]{3})', lambda match: bytes([int(match[1], 8)]), old_string.encode('utf-8')).decode('utf-8')

如果字符串已定义为 r'string' 而不是 b'string'，则将字符串解码为 UTF-8

Decoding string to UTF-8 if string is already defined as r'string' instead of b'string'

python

string

encoding

utf-8