Python 请求奇怪 URL %-编码

Question

使用 Python 3.6.1，请求 2.13.0，我对请求的 URL 进行了奇怪的编码。我有一个 URL 在查询字符串中包含中文字符，例如 huà 話用，它应该 %-encode 为 hu%C3%A0%20%E8%A9%B1%20%E7%94%A8 甚至 hu%C3%A0+%E8%A9%B1+%E7%94%A8，但由于某种原因它是 % -编码为 hu%C3%83%C2%A0%20%C3%A8%C2%A9%C2%B1%20%C3%A7%C2%94%C2%A8。这是不正确的。我一直在使用 http://r12a.github.io/apps/conversion/ 页面来帮助我处理编码。我还使用了 JavaScript encodeURI 和 PHP urlencode 并且没有得到任何接近我看到的请求库所做的事情。

我是不是做错了什么导致编码相差太远？

更新：我研究了 Mojibake 编码并深入研究了 Requests 库，发现了问题所在，但我仍然不确定如何解决它。

我正在使用简单的 .get(url) 调用对内部服务器进行调用。调用转到服务器并获得重定向响应。重定向页面在header中有一个meta charset="UTF-8"，其中列出的URL是正确的。 location header 离开服务器即可；它被编码为 UTF-8，Content-Type header 上有一个 charset=UTF-8。但是，当我调试 Python 中的重定向响应时，我注意到响应 object 中的 header 不正确，它似乎没有被正确解码。 headers 属性在 location 中包含此内容：huÃ\xa0 è©± ç\x94。上面说了，应该解码为：huà 話用。因此，那个奇怪的 URL 查询字符串的 % 由 Requests 编码并设置回服务器，然后服务器拒绝 URL （显然）。

我能做些什么来防止 Requests 把事情搞砸吗？或者让它正确解码 location header？ Web 浏览器似乎没有这个问题。

Answer 1

你有一个 Mojibake 编码。编码的字节是 UTF-8 字节的 Latin-1 解释：

>>> from urllib.parse import quote
>>> text = 'huà 話 用'
>>> quote(text)
'hu%C3%A0%20%E8%A9%B1%20%E7%94%A8'
>>> quote(text.encode('utf8').decode('latin1'))
'hu%C3%83%C2%A0%20%C3%A8%C2%A9%C2%B1%20%C3%A7%C2%94%C2%A8'

您可以通过再次手动编码为 Latin-1，然后从 UTF-8 解码来反转该过程：

>>> unquote('hu%C3%83%C2%A0%20%C3%A8%C2%A9%C2%B1%20%C3%A7%C2%94%C2%A8').encode('latin1').decode('utf8')
'huà 話 用'

或者您可以使用 ftfy library 来自动修复错误的编码（ftfy 通常做得更好，尤其是当 Windows 代码页涉及到 Mojibake 时）：

>>> from ftfy import fix_text
>>> fix_text(unquote('hu%C3%83%C2%A0%20%C3%A8%C2%A9%C2%B1%20%C3%A7%C2%94%C2%A8'))
'huà 話 用'

你说的是URL的来源：

The location header leaving the server is ok; it is encoded as UTF-8

那是你的问题，就在那里。 HTTP header总是编码为 Latin-1^(*)。服务器必须将 Location header 设置为完全 percent-encoded URL，以便所有 UTF-8 字节都表示为 %HH 转义序列。这些只是 ASCII 字符，完美保存在 Latin-1 上下文中。

如果您的服务器将 header 作为 un-escaped UTF-8 字节发送，则 HTTP 客户端（包括 requests）会将其解码为 Latin-1，而不是生成准确的 Mojibake你观察到的问题。由于 URL 包含无效的 URL 字符，requests 将 Mojibake 结果转义为 percent-encoded 版本。

^(*) 实际上，Location header 应该是 absoluteURI as per RFC2396 which is always ASCII (7-bit) clean data, but because some other HTTP headers allow for 'descriptive' text, Latin-1 (ISO-8859-1) is the accepted default encoding for header data. See the TEXT rule in section 2.2 of the HTTP/1.1 RFC, and the http.client module that ultimately decodes the headers for requests follows this RFC in this regard when decoding non-ASCII data in any header. You can provide non-Latin-1 data only if wrapped as per the Message Header Extensions RFC 2047，但这不适用于 Location header.

Python 请求奇怪 URL %-编码

Python Requests Strange URL %-Encoding

python

url-encoding

mojibake

python-3.x

python-requests