如何将转义序列转换为文字字符

Question

我必须处理将西里尔字符转换为转义序列的 rtf 文件：

{\rtf1\fbidis\ansicpg1251{\info{\title \'D1\'EF\'E8\'F1\'EE\'EA\'20\'E4\'EE\'EA\'F3\'EC\'E5\'ED\'F2\'EE\'E2}

我想转换西里尔字符，但未更改 rtf 标签。有没有第三方应用程序（如 OpenOffice）的 pythonic 方式来做到这一点？

Answer 1

我们可以先使用正则表达式列出十六进制代码，然后用这些值创建一个字节对象，我们可以对其进行解码。您的数据似乎是使用“cp1251”编码的。

data = r"pg1251{\info{\title \'D1\'EF\'E8\'F1\'EE\'EA\'20\'E4\'EE\'EA\'F3\'EC\'E5\'ED\'F2\'EE\'E2}"

hex_codes = re.findall(r"(?<=')[0-9A-F]{2}", data)
encoded = bytes(int(hcode, 16) for hcode in hex_codes)
# or, as rightly suggested by @Henry Tjhia:
# encoded = bytes.fromhex(''.join(hex_codes))
text = encoded.decode('cp1251')
print(text)
# Список документов

Answer 2

尽管@Thierry Lathuille 的回答没有解决最初的问题（我需要不变的 rtf 标签），但它解决了最困难的部分。所以初始问题的解法：

string = "{\rtf1\fbidis\ansicpg1251{\info{\title \'D1\'EF\'E8\'F1\'EE\'EA\'20\'E4\'EE\'EA\'F3\'EC\'E5\'ED\'F2\'EE\'E2}"
hex_codes = re.findall("(?<=')[0-9A-F]{2}", string)
d = {"\\'" + code: bytes.fromhex(code).decode("cp1251") for code in hex_codes}
for byte, char in d.items():
    string = string.replace(byte, char)
print(string)
# {\rtf1\fbidis\ansicpg1251{\info{\title Список документов}

如何将转义序列转换为文字字符

How to convert escaped sequences to literal characters

python

rtf