url当 url 键入错误时，打开不返回 None 对象

Question

我目前正在使用 Python 研究 Ryan Mitchell 的 Web Scraping。在第一章中，当他谈到处理错误时，他说：

If the server is not found at all (if say, site was down, or the URL was mistyped), urlopen returns a None object.

因此，为了对此进行测试，我创建了以下代码段。

from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup as bs

def getTitle(url):

    try:
        html = urlopen(url).read()
    except HTTPError:
        return None

    try:
        bsObj = bs(html)
    except AttributeError:
        return None
    return bsObj

title = getTitle('http://www.wunderlst.com')
print(title)

在这段代码的倒数第二行，我故意打错了 URL 名称（实际 URL 是 http://www.wunderlist.com）。我希望现在我会在屏幕上打印 None。但是，我得到一长串错误。下面我给出错误信息的最后一部分：

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "ex4.py", line 18, in <module>
    title = getTitle('http://www.wunderlst.com')
  File "ex4.py", line 8, in getTitle
    html = urlopen(url).read()
  File "/usr/lib/python3.4/urllib/request.py", line 161, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib/python3.4/urllib/request.py", line 463, in open
    response = self._open(req, data)
  File "/usr/lib/python3.4/urllib/request.py", line 481, in _open
    '_open', req)
  File "/usr/lib/python3.4/urllib/request.py", line 441, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.4/urllib/request.py", line 1210, in http_open
    return self.do_open(http.client.HTTPConnection, req)
  File "/usr/lib/python3.4/urllib/request.py", line 1184, in do_open
    raise URLError(err)
urllib.error.URLError: <urlopen error [Errno -2] Name or service not known>

现在，如果我更正URL名称，但在网站前面写一些不存在的页面，例如：

title = getTitle('http://www.wunderlist.com/something')

然后我在屏幕上打印 None。我真的很困惑。任何人都可以向我解释实际发生了什么吗？提前致谢。

Answer 1

您所指的 book/article 有误或已过时。在urllib documentation你可以阅读

If the connection cannot be made the IOError exception is raised.

如果无法解析主机名，显然无法建立连接，因此必须根据文档提出 IOError。 URLError 是旧 Python 中 IOError 的子类，新版本的 urllib 似乎没有 urlopen 功能，我粗略地看了一下。

如评论中所述，我将库弄错了（urllib 而不是urllib.request）；你会发现类似的一行说

Raises URLError on errors.

虽然在那里。据推测，像 404 这样的 HTTP 错误不被视为 urlopen 的错误，这就是为什么如果路径错误它不会引发异常，但如果无法解析主机名则会抛出错误。

Answer 2

通常，由于没有网络连接（没有到指定服务器的路由）或指定的服务器不存在，引发 URLError。

'http://www.wunderlst.com' 不存在，这就是引发错误的原因。

查看以下内容link 了解更多详情。

https://docs.python.org/3.1/howto/urllib2.html#handling-exceptions

Answer 3

我认为问题在于您只捕获了 HTTPError（并返回了 None）。也尝试 treat/catch URLError 异常。

替换
from urllib.error import HTTPError
与
from urllib.error import HTTPError, URLError.

替换
except HTTPError:
与
except (HTTPError, URLError):

这将为您提供所需的行为（在两种情况下都返回 None）。但我建议分别处理这些错误（将第一个 try 块移动到另一个方法，停止抓取错误等）。

url当 url 键入错误时，打开不返回 None 对象

urlopen not returning None object when url is mistyped

python

urllib

beautifulsoup

python-3.x