urllib.unquote_plus 对同一个字符串给出不同的输出

Question

我有点 python 菜鸟，无法理解这里发生的事情。我正在解码 URL 编码的字符串。我有一个文件 dump®.txt（前导 space 是故意的）。当我更改我的对象类型时，我得到两个不同的结果：

>>> string1 = u'+dump%C2%AE.txt'
>>> print urllib.unquote_plus(string1)
 dumpÂ®.txt

>>> string2 = '+dump%C2%AE.txt'
>>> print urllib.unquote_plus(string2)
 dump®.txt

我原以为 string1 和 string2 只会看到 ® 字符（甚至可能是相反的行为）。谁能帮我理解为什么 string1 在我得到我想要的 dump®.txt 之前需要是字符串类型？

Answer 1

Can anyone help me understand why is it that string1 needs to be a string type before I get my desired dump®.txt?

urllib 不支持 unicode，因为根据定义，url 可以包含 ASCII 字符。如果您传递一个 unicode 对象，有些 crappy code in Python 2 会尝试做正确的事情，但它有问题。

将 unicode 对象传递给 urllib.unquote 是用户错误 - 不要那样做。这是正确的：

print urllib.unquote_plus(string1.encode())

urllib.unquote_plus 对同一个字符串给出不同的输出

urllib.unquote_plus gives a different output on the same string

python

urllib

urlencode

python-2.7

python-unicode