python3 中的解码字符串

Question

如何转换

str1 = 'Sabrau00AE Family Size Roasted Pine Nut Hummus - 17 oz'

至

final_str = 'Sabra® Family Size Roasted Pine Nut Hummus - 17oz'` in python3.

我试过：

str1.encode('utf-8') html.unescape
str1.encode('utf-8').decode('unicode_escape')
str1.encode('utf-8').decode('ascii')

但运气不好。

isinstance(str1,str) 的输出是 True str1.encode('utf=8') 的输出是字节字符串 b'Sabrau00AE Family Size Roasted Pine Nut Hummus - 17 oz'

我也导入了charade，但是我的编解码函数出错了

AttributeError: 'str' object has no attribute 'decode'  
AttributeError: 'str' object has no attribute 'encoding'

Answer 1

您正在寻找\u；在代码点之前使用它，它将呈现正确的 unicode 字符。

>>> str1 = 'Sabrau\u00AE Family Size Roasted Pine Nut Hummus - 17 oz'
>>> str1
'Sabrau® Family Size Roasted Pine Nut Hummus - 17 oz'

Answer 2

您的字符串没有标准编码，可能有歧义。假设任何时候“连续四个十六进制数字”意味着“插入 Unicode 代码点”，那么下面的工作，但请注意 any u 与 4 位数字将转换为Unicode字符，例如“Plateau1000 Protein Powder”将变成“Plateaက Protein Powder”：

import re

# 1. locate u followed by 4 hexdigits
# 2. capture digits and convert to an integer using base 16
# 3. convert integer to a Unicode char
# 4. use character as the substitution for the digits
def convert(s):
    return re.sub(r'u([0-9A-F]{4})',lambda m: chr(int(m.group(1),16)), s)

str1 = 'Sabrau00AE Family Size Roasted Pine Nut Hummus - 17 oz'
str2 = convert(str1)
print(str2)

输出：

Sabra® Family Size Roasted Pine Nut Hummus - 17 oz

Answer 3

感谢 @Mark Tolonen 在正则表达式方面的帮助。在您的输出中，我在名称中也得到了 'u' 以及解码后的符号。因此，我通过

使用以下代码修复了边缘情况

查找旁边有'u'和4个digit/characters的子串。
正在使用替换函数将此子字符串转换为 Unicode 字符串
使用 Unicode-escape 解码

下面的代码有效：

def convert(s):
    # return re.sub(r'[0-9A-F]{4}',lambda m: chr(int(m.group(),16)), s)
    return str.encode(re.sub(r'u[0-9A-F]{4}',lambda m:(m.group().replace('u','\u')),s),'utf-8').decode('unicode-escape')

输入：

 str1 = 'Sabrau00AE Family Size Roasted Pine Nut Hummus - 17 oz'

代码：

str2=convert(str1)
print (str2)
print(type(str2))

输出：

Sabra® Family Size Roasted Pine Nut Hummus - 17 oz
<class 'str'>

python3 中的解码字符串

Decoding string in python3

python

ascii

encode

decode

utf-8