urllib2.quote 无法正常工作

Question

我正在尝试获取 html 包含变音符号（í、č...）的页面。问题是 urllib2.quote 似乎没有像我预期的那样工作。

就我而言，引用应将包含变音符号的 url 转换为正确的 url.

这是一个例子：

url = 'http://www.example.com/vydavatelství/'

print urllib2.quote(url)

>> http%3A//www.example.com/vydavatelstv%C3%AD/

问题是它出于某种原因更改了 http// 字符串。然后 urllib2.urlopen(req) returns 错误：

response = urllib2.urlopen(req)
File "C:\Python27\lib\urllib2.py", line 154, in urlopen return opener.open(url, data, timeout) File "C:\Python27\lib\urllib2.py", line 437, in open response = meth(req, response)
File "C:\Python27\lib\urllib2.py", line 550, in http_response 'http', request, response, code, msg, hdrs)
File "C:\Python27\lib\urllib2.py", line 475, in error return self._call_chain(*args)
File "C:\Python27\lib\urllib2.py", line 409, in _call_chain result = func(*args)
File "C:\Python27\lib\urllib2.py", line 558, in http_error_default raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 400: Bad Request

Answer 1

-- 长话短说--

两件事。首先确保您将 shebang # -- coding: utf-8 -- 包含在 python 脚本的顶部。这让我们 python 知道如何对文件中的文本进行编码。其次，您需要指定安全字符，这些字符不会被 quote 方法转换。默认情况下，只有 / 被指定为安全字符。这意味着 : 正在转换，这会破坏您的 URL.

url = 'http://www.example.com/vydavatelství/'
urllib2.quote(url,':/')
>>> http://www.example.com/vydavatelstv%C3%AD/

-- 多说一点--

所以这里的第一个问题是 urllib2 的文档很差。从 Kamal 提供的 link 开始，我在文档中没有看到 quote 方法的提及。这使得解决问题变得非常困难。

话虽如此，让我稍微解释一下。

urllib2.quote 似乎与 urllib 的 quote 实现相同，即 documented pretty well。 urllib2.quote() 有四个参数

urllib.parse.quote(string, safe='/', encoding=None, errors=None)
##   string: string your trying to encode
##     safe: string contain characters to ignore. Defualt is '/'
## encoding: type of encoding url is in. Default is utf-8
##   errors: specifies how errors are handled. Default is 'strict' which throws a UnicodeEncodeError, I think.

urllib2.quote 无法正常工作

urllib2.quote does not work properly

html

python

url

urllib2