如何解码 b"\x95\xc3\x8a\xb0\x8ds\x86\x89\x94\x82\x8a\xba"?
How do decode b"\x95\xc3\x8a\xb0\x8ds\x86\x89\x94\x82\x8a\xba"?
[总结]:
从文件中抓取的数据是
b"\x95\xc3\x8a\xb0\x8ds\x86\x89\x94\x82\x8a\xba"
请问如何将这些字节解码成可读的汉字?
======
我从 exe 文件中提取了一些游戏脚本。文件是用Enigma Virtual Box打包的,我解压了。
然后我就可以正确地看到脚本的英文名称了。
在分析这些脚本时,我得到如下错误:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x95 in position 0: invalid start byte
我改成GBK解码,错误消失
但是输出文件不可读。它包括可读的英文字符和不可读的应该是中文的内容。示例:
chT0002>pDIӘIʆ
我尝试了不同的编码来保存文件,它们显示的结果相同,所以问题可能出在解码部分。
从文件中抓取的数据为
b"\x95\xc3\x8a\xb0\x8ds\x86\x89\x94\x82\x8a\xba"
我试了很多方法,就是不能把这些字节解码成可读的汉字。文件本身有什么问题吗?或者别的地方?我真的需要帮助。
附上其中一个脚本here。
为了可靠地解码字节,您必须知道字节是如何编码的。我将从 python codecs
文档中借用引述:
Without external information it’s impossible to reliably determine which encoding was used for encoding a string.
如果没有这些信息,有一些方法可以尝试检测编码(chardet
似乎是最广泛使用的)。以下是您可以采取的方法。
import chardet
data = b"\x95\xc3\x8a\xb0\x8ds\x86\x89\x94\x82\x8a\xba"
detected = chardet.detect(data)
decoded = data.decode(detected["encoding"])
但是,上面的示例在这种情况下不起作用,因为 chardet
无法检测这些字节的编码。到那时,您将不得不使用试错法或尝试其他库。
您可以使用的一种方法是简单地尝试每个 standard encoding,打印出结果,然后查看哪种编码有意义。
codecs = [
"ascii", "big5", "big5hkscs", "cp037", "cp273", "cp424", "cp437", "cp500", "cp720",
"cp737", "cp775", "cp850", "cp852", "cp855", "cp856", "cp857", "cp858", "cp860",
"cp861", "cp862", "cp863", "cp864", "cp865", "cp866", "cp869", "cp874", "cp875",
"cp932", "cp949", "cp950", "cp1006", "cp1026", "cp1125", "cp1140", "cp1250",
"cp1251", "cp1252", "cp1253", "cp1254", "cp1255", "cp1256", "cp1257",
"cp1258", "cp65001", "euc_jp", "euc_jis_2004", "euc_jisx0213", "euc_kr", "gb2312",
"gbk", "gb18030", "hz", "iso2022_jp", "iso2022_jp_1", "iso2022_jp_2",
"iso2022_jp_2004", "iso2022_jp_3", "iso2022_jp_ext", "iso2022_kr", "latin_1",
"iso8859_2", "iso8859_3", "iso8859_4", "iso8859_5", "iso8859_6", "iso8859_7",
"iso8859_8", "iso8859_9", "iso8859_10", "iso8859_11", "iso8859_13", "iso8859_14",
"iso8859_15", "iso8859_16", "johab", "koi8_r", "koi8_t", "koi8_u", "kz1048",
"mac_cyrillic", "mac_greek", "mac_iceland", "mac_latin2", "mac_roman",
"mac_turkish", "ptcp154", "shift_jis", "shift_jis_2004", "shift_jisx0213",
"utf_32", "utf_32_be", "utf_32_le", "utf_16", "utf_16_be", "utf_16_le", "utf_7",
"utf_8", "utf_8_sig",
]
data = b"\x95\xc3\x8a\xb0\x8ds\x86\x89\x94\x82\x8a\xba"
for codec in codecs:
try:
print(f"{codec}, {data.decode(codec)}")
except UnicodeDecodeError:
continue
输出
cp037, nC«^ýËfimb«[
cp273, nC«¢ýËfimb«¬
cp437, ò├è░ìsåëöéè║
cp500, nC«¢ýËfimb«¬
cp720, ـ├è░së¤éè║
cp737, Χ├Λ░ΞsΗΚΦΓΛ║
cp775, Ģ├Ŗ░ŹsåēöéŖ║
cp850, ò├è░ìsåëöéè║
cp852, Ľ├Ő░ŹsćëöéŐ║
cp855, Ћ├і░ЇsєЅћѓі║
cp856, ץ├ך░םsזיפגך║
cp857, ò├è░ısåëöéè║
cp858, ò├è░ìsåëöéè║
cp860, ò├è░ìsÁÊõéè║
cp861, þ├è░Þsåëöéè║
cp862, ץ├ך░םsזיפגך║
cp863, Ï├è░‗s¶ëËéè║
cp864, ¼ﺃ├٠┌s│┬½∙├ﻑ
cp865, ò├è░ìsåëöéè║
cp866, Х├К░НsЖЙФВК║
cp875, nCα£δΉfimbας
cp949, 빩뒺뛱냹봻듆
cp1006, ﺣﺍsﭦ
cp1026, nC«¢`Ëfimb«¬
cp1125, Х├К░НsЖЙФВК║
cp1140, nC«^ýËfimb«[
cp1250, •ĂŠ°Ťs†‰”‚Šş
cp1251, •ГЉ°Ќs†‰”‚Љє
cp1256, •أٹ°چs†‰”‚ٹ؛
gbk, 暶姲峴唹攤姾
gb18030, 暶姲峴唹攤姾
latin_1, ðsº
iso8859_2, ðsş
iso8859_4, ðsē
iso8859_5, УАsК
iso8859_7, Γ°sΊ
iso8859_9, ðsº
iso8859_10, ðsš
iso8859_11, รฐsบ
iso8859_13, ưsŗ
iso8859_14, ÃḞsẃ
iso8859_15, ðsº
iso8859_16, ðsș
koi8_r, ∙ц┼╟█s├┴■┌┼╨
koi8_u, ∙ц┼╟█s├┴■┌┼╨
kz1048, •ГЉ°Қs†‰”‚Љғ
mac_cyrillic, Х√К∞НsЖЙФВКЇ
mac_greek, ïΟäΑçsÜâî²äΚ
mac_iceland, ï√ä∞çsÜâîÇä∫
mac_latin2, ē√äįćsÜČĒāäļ
mac_roman, ï√ä∞çsÜâîÇä∫
mac_turkish, ï√ä∞çsÜâîÇä∫
ptcp154, •ГҠ°ҚsҶү”ӮҠә
shift_jis_2004, 陛寛行̹狽桓
shift_jisx0213, 陛寛行̹狽桓
utf_16, 쎕낊玍覆芔몊
utf_16_be, 闃誰赳蚉钂誺
utf_16_le, 쎕낊玍覆芔몊
编辑:经过运行所有看似清晰的结果经过Google翻译后,我怀疑这个编码是UTF-16 big-endian。结果如下:
+-----------+---------------+--------------------+--------------------------+
| Encoding | Decoded | Language Detected | English Translation |
+-----------+---------------+--------------------+--------------------------+
| gbk | 暶姲峴唹攤姾 | Chinese | Jian Xian JiaoTanJiao |
| gb18030 | 暶姲峴唹攤姾 | Chinese | Jian Xian Jiao Tan Jiao |
| utf_16 | 쎕낊玍覆芔몊 | Korean | None |
| utf_16_be | 闃誰赳蚉钂誺 | Chinese | Who is the epiphysis? |
| utf_16_le | 쎕낊玍覆芔몊 | Korean | None |
+-----------+---------------+--------------------+--------------------------+
[总结]: 从文件中抓取的数据是
b"\x95\xc3\x8a\xb0\x8ds\x86\x89\x94\x82\x8a\xba"
请问如何将这些字节解码成可读的汉字?
======
我从 exe 文件中提取了一些游戏脚本。文件是用Enigma Virtual Box打包的,我解压了。
然后我就可以正确地看到脚本的英文名称了。
在分析这些脚本时,我得到如下错误:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x95 in position 0: invalid start byte
我改成GBK解码,错误消失
但是输出文件不可读。它包括可读的英文字符和不可读的应该是中文的内容。示例:
chT0002>pDIӘIʆ
我尝试了不同的编码来保存文件,它们显示的结果相同,所以问题可能出在解码部分。
从文件中抓取的数据为
b"\x95\xc3\x8a\xb0\x8ds\x86\x89\x94\x82\x8a\xba"
我试了很多方法,就是不能把这些字节解码成可读的汉字。文件本身有什么问题吗?或者别的地方?我真的需要帮助。
附上其中一个脚本here。
为了可靠地解码字节,您必须知道字节是如何编码的。我将从 python codecs
文档中借用引述:
Without external information it’s impossible to reliably determine which encoding was used for encoding a string.
如果没有这些信息,有一些方法可以尝试检测编码(chardet
似乎是最广泛使用的)。以下是您可以采取的方法。
import chardet
data = b"\x95\xc3\x8a\xb0\x8ds\x86\x89\x94\x82\x8a\xba"
detected = chardet.detect(data)
decoded = data.decode(detected["encoding"])
但是,上面的示例在这种情况下不起作用,因为 chardet
无法检测这些字节的编码。到那时,您将不得不使用试错法或尝试其他库。
您可以使用的一种方法是简单地尝试每个 standard encoding,打印出结果,然后查看哪种编码有意义。
codecs = [
"ascii", "big5", "big5hkscs", "cp037", "cp273", "cp424", "cp437", "cp500", "cp720",
"cp737", "cp775", "cp850", "cp852", "cp855", "cp856", "cp857", "cp858", "cp860",
"cp861", "cp862", "cp863", "cp864", "cp865", "cp866", "cp869", "cp874", "cp875",
"cp932", "cp949", "cp950", "cp1006", "cp1026", "cp1125", "cp1140", "cp1250",
"cp1251", "cp1252", "cp1253", "cp1254", "cp1255", "cp1256", "cp1257",
"cp1258", "cp65001", "euc_jp", "euc_jis_2004", "euc_jisx0213", "euc_kr", "gb2312",
"gbk", "gb18030", "hz", "iso2022_jp", "iso2022_jp_1", "iso2022_jp_2",
"iso2022_jp_2004", "iso2022_jp_3", "iso2022_jp_ext", "iso2022_kr", "latin_1",
"iso8859_2", "iso8859_3", "iso8859_4", "iso8859_5", "iso8859_6", "iso8859_7",
"iso8859_8", "iso8859_9", "iso8859_10", "iso8859_11", "iso8859_13", "iso8859_14",
"iso8859_15", "iso8859_16", "johab", "koi8_r", "koi8_t", "koi8_u", "kz1048",
"mac_cyrillic", "mac_greek", "mac_iceland", "mac_latin2", "mac_roman",
"mac_turkish", "ptcp154", "shift_jis", "shift_jis_2004", "shift_jisx0213",
"utf_32", "utf_32_be", "utf_32_le", "utf_16", "utf_16_be", "utf_16_le", "utf_7",
"utf_8", "utf_8_sig",
]
data = b"\x95\xc3\x8a\xb0\x8ds\x86\x89\x94\x82\x8a\xba"
for codec in codecs:
try:
print(f"{codec}, {data.decode(codec)}")
except UnicodeDecodeError:
continue
输出
cp037, nC«^ýËfimb«[
cp273, nC«¢ýËfimb«¬
cp437, ò├è░ìsåëöéè║
cp500, nC«¢ýËfimb«¬
cp720, ـ├è░së¤éè║
cp737, Χ├Λ░ΞsΗΚΦΓΛ║
cp775, Ģ├Ŗ░ŹsåēöéŖ║
cp850, ò├è░ìsåëöéè║
cp852, Ľ├Ő░ŹsćëöéŐ║
cp855, Ћ├і░ЇsєЅћѓі║
cp856, ץ├ך░םsזיפגך║
cp857, ò├è░ısåëöéè║
cp858, ò├è░ìsåëöéè║
cp860, ò├è░ìsÁÊõéè║
cp861, þ├è░Þsåëöéè║
cp862, ץ├ך░םsזיפגך║
cp863, Ï├è░‗s¶ëËéè║
cp864, ¼ﺃ├٠┌s│┬½∙├ﻑ
cp865, ò├è░ìsåëöéè║
cp866, Х├К░НsЖЙФВК║
cp875, nCα£δΉfimbας
cp949, 빩뒺뛱냹봻듆
cp1006, ﺣﺍsﭦ
cp1026, nC«¢`Ëfimb«¬
cp1125, Х├К░НsЖЙФВК║
cp1140, nC«^ýËfimb«[
cp1250, •ĂŠ°Ťs†‰”‚Šş
cp1251, •ГЉ°Ќs†‰”‚Љє
cp1256, •أٹ°چs†‰”‚ٹ؛
gbk, 暶姲峴唹攤姾
gb18030, 暶姲峴唹攤姾
latin_1, ðsº
iso8859_2, ðsş
iso8859_4, ðsē
iso8859_5, УАsК
iso8859_7, Γ°sΊ
iso8859_9, ðsº
iso8859_10, ðsš
iso8859_11, รฐsบ
iso8859_13, ưsŗ
iso8859_14, ÃḞsẃ
iso8859_15, ðsº
iso8859_16, ðsș
koi8_r, ∙ц┼╟█s├┴■┌┼╨
koi8_u, ∙ц┼╟█s├┴■┌┼╨
kz1048, •ГЉ°Қs†‰”‚Љғ
mac_cyrillic, Х√К∞НsЖЙФВКЇ
mac_greek, ïΟäΑçsÜâî²äΚ
mac_iceland, ï√ä∞çsÜâîÇä∫
mac_latin2, ē√äįćsÜČĒāäļ
mac_roman, ï√ä∞çsÜâîÇä∫
mac_turkish, ï√ä∞çsÜâîÇä∫
ptcp154, •ГҠ°ҚsҶү”ӮҠә
shift_jis_2004, 陛寛行̹狽桓
shift_jisx0213, 陛寛行̹狽桓
utf_16, 쎕낊玍覆芔몊
utf_16_be, 闃誰赳蚉钂誺
utf_16_le, 쎕낊玍覆芔몊
编辑:经过运行所有看似清晰的结果经过Google翻译后,我怀疑这个编码是UTF-16 big-endian。结果如下:
+-----------+---------------+--------------------+--------------------------+
| Encoding | Decoded | Language Detected | English Translation |
+-----------+---------------+--------------------+--------------------------+
| gbk | 暶姲峴唹攤姾 | Chinese | Jian Xian JiaoTanJiao |
| gb18030 | 暶姲峴唹攤姾 | Chinese | Jian Xian Jiao Tan Jiao |
| utf_16 | 쎕낊玍覆芔몊 | Korean | None |
| utf_16_be | 闃誰赳蚉钂誺 | Chinese | Who is the epiphysis? |
| utf_16_le | 쎕낊玍覆芔몊 | Korean | None |
+-----------+---------------+--------------------+--------------------------+