Python 3.4.3 中的错误 URL

Question

我是新手，所以请帮助我。我正在使用 urllib.request 打开和阅读网页。有人能告诉我我的代码如何处理重定向、超时、格式错误的 URL 吗？我找到了一种超时方法，但我不确定它是否正确。是吗？欢迎所有意见！这是：

from socket import timeout
import urllib.request
try:
            text = urllib.request.urlopen(url, timeout=10).read().decode('utf-8')
except (HTTPError, URLError) as error:
            logging.error('Data of %s not retrieved because %s\nURL: %s', name, error, url)
except timeout:
            logging.error('socket timed out - URL %s', url)

请帮助我，因为我是新手。谢谢！

Answer 1

看看urllib error page。

因此对于以下行为：

重定向：HTTP 代码 302，因此这是带有代码的 HTTPError。您也可以使用 HTTPRedirectHandler 而不是失败。
超时：你答对了。
格式错误URLs：这是一个URL错误。

这是我要使用的代码：

from socket import timeout
import urllib.request
try:
    text = urllib.request.urlopen("http://www.google.com", timeout=0.1).read()
except urllib.error.HTTPError as error:
    print(error)
except urllib.error.URLError as error:
    print(error)
except timeout as error:
    print(error)

我找不到重定向 URL，所以我不确定如何检查 HTTPError 是否是重定向。

您可能会发现 requests 包更易于使用（在 urllib 页面上有建议）。

Answer 2

使用 requests 包我找到了更好的解决方案。您需要处理的唯一例外是：

 try:
        r = requests.get(url, timeout =5)

except requests.exceptions.Timeout:
# Maybe set up for a retry, or continue in a retry loop

except requests.exceptions.TooManyRedirects as error:
# Tell the user their URL was bad and try a different one

except requests.exceptions.ConnectionError:
# Connection could not be completed

except requests.exceptions.RequestException as e:
# catastrophic error. bail.

要获取该页面的文本，您需要做的就是： r.text

Python 3.4.3 中的错误 URL

Bad URLs in Python 3.4.3

python

urllib

python-3.x