使用 urllib2 覆盖 HTTP 错误

Question

我有这段代码，但它不起作用。我想使用 urllib2 遍历 url 的列表。打开每个 url 后，BeautifulSoup 会找到一个 class 并提取该文本。如果列表中存在无效的 url，程序将停止。如果有任何错误，我只想将 'error' 作为文本，让程序继续到下一个 url。有什么想法吗？

    for url in url_list:
         page=urllib2.urlopen(url)
         soup = BeautifulSoup(page.read())

         text = soup.find_all(class_='ProfileHeaderCard-locationText u-dir')
         if text is not None:
            for t in text:
                text2 = t.get_text().encode('utf-8')
         else:
            text2 = 'error'

Answer 1

try/except是你的朋友！将您的代码更改为 s/thing like...:[=12=]

for url in url_list:
    try:
        page = urllib2.urlopen(url)
    except urllib2.URLError:
        text2 = 'error'
    else:
        soup = BeautifulSoup(page.read())
        text = soup.find_all(class_='ProfileHeaderCard-locationText u-dir')
        if text:
           for t in text:
               text2 = t.get_text().encode('utf-8')
        else:
           text2 = 'error'

Answer 2

urllib2.urlopen 在出现错误时引发 URLError，您可以在 docs

中找到

使用 try-except 块：

try:
    page = urllib2.urlopen(url)
except urllib2.URLError as e:
    print e

使用 urllib2 覆盖 HTTP 错误

Overriding HTTP errors with urllib2

python

http

urllib2

beautifulsoup