从文件中读取包含十六进制字节字符串字符的字符串并解码?
Read str from file contain hex bytes str character and decode?
我有一个文件 example.log
,其中包含:
<POOR_IN200901UV xmlns="urn:hl7-org:v3"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" ITSVersion="XML_1.0"
xsi:schemaLocation="urn:hl7-org:v3
../../Schemas/POOR_IN200901UV20.xsd">\n\t<!-- \xe6\xb6\x88\xe6\x81\xafID -
->\n\t<id extension="BS002"/>
我想读取文件并将str转换为utf-8
编码格式并写入新文件。目前我的代码如下:
with open("example_decoded.log", 'w') as f:
for line in open("example.log", 'r', encoding='utf-8'):
m = re.search("<POOR_IN200901UV", line)
if m:
line = line[m.start():-2]
line_bytes = bytes(line, encoding='raw_unicode_escape')
line_decoded = line_bytes.decode('utf-8')
print(line_decoded)
f.write(line_decoded)
else:
pass
但是 example_decoded.log
的内容:
<POOR_IN200901UV xmlns="urn:hl7-org:v3"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" ITSVersion="XML_1.0"
xsi:schemaLocation="urn:hl7-org:v3
../../Schemas/POOR_IN200901UV20.xsd">\n\t<!-- \xe6\xb6\x88\xe6\x81\xafID -
->\n\t<id extension="BS002"
\xe6\xb6\x88\xe6\x81\xaf
部分没有被解码,所以我想知道如何处理这个 mix-type str 解码问题?
decodedVal = struct.unpack(">f", bytes.fromhex(encdoded_val))[0]
参考下面 link 添加您的字节序并键入而不是 ">f"
import codecs
decode_hex = codecs.getdecoder("hex_codec")
string = decode_hex(string)[0]
参考这个:Read hex characters and convert them to utf-8 using python 3
解决方法是:
with open("example_decoded.log", 'w') as f:
for line in open("example.log", 'r', encoding='utf-8'):
m = re.search("<POOR_IN200901UV", line)
if m:
line = line[m.start():-2]
line_decoded = bytes(line, 'utf-8').decode('unicode_escape').encode('latin-1').decode('utf8')
print(line_decoded)
f.write(line_decoded)
else:
pass
虽然我不明白为什么encode('latin-1')
首先,
有人可以解释一下吗?
我有一个文件 example.log
,其中包含:
<POOR_IN200901UV xmlns="urn:hl7-org:v3"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" ITSVersion="XML_1.0"
xsi:schemaLocation="urn:hl7-org:v3
../../Schemas/POOR_IN200901UV20.xsd">\n\t<!-- \xe6\xb6\x88\xe6\x81\xafID -
->\n\t<id extension="BS002"/>
我想读取文件并将str转换为utf-8
编码格式并写入新文件。目前我的代码如下:
with open("example_decoded.log", 'w') as f:
for line in open("example.log", 'r', encoding='utf-8'):
m = re.search("<POOR_IN200901UV", line)
if m:
line = line[m.start():-2]
line_bytes = bytes(line, encoding='raw_unicode_escape')
line_decoded = line_bytes.decode('utf-8')
print(line_decoded)
f.write(line_decoded)
else:
pass
但是 example_decoded.log
的内容:
<POOR_IN200901UV xmlns="urn:hl7-org:v3"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" ITSVersion="XML_1.0"
xsi:schemaLocation="urn:hl7-org:v3
../../Schemas/POOR_IN200901UV20.xsd">\n\t<!-- \xe6\xb6\x88\xe6\x81\xafID -
->\n\t<id extension="BS002"
\xe6\xb6\x88\xe6\x81\xaf
部分没有被解码,所以我想知道如何处理这个 mix-type str 解码问题?
decodedVal = struct.unpack(">f", bytes.fromhex(encdoded_val))[0]
参考下面 link 添加您的字节序并键入而不是 ">f"
import codecs
decode_hex = codecs.getdecoder("hex_codec")
string = decode_hex(string)[0]
参考这个:Read hex characters and convert them to utf-8 using python 3
解决方法是:
with open("example_decoded.log", 'w') as f:
for line in open("example.log", 'r', encoding='utf-8'):
m = re.search("<POOR_IN200901UV", line)
if m:
line = line[m.start():-2]
line_decoded = bytes(line, 'utf-8').decode('unicode_escape').encode('latin-1').decode('utf8')
print(line_decoded)
f.write(line_decoded)
else:
pass
虽然我不明白为什么encode('latin-1')
首先,
有人可以解释一下吗?