urllib 读取里面 except 不起作用

Question

我正在尝试从多个网站读取源代码 (html) 并使用以下代码。只要站点以 utf-8 编码，它就可以正常工作，但以 ISO-8859-1 编码的站点会导致一些问题。正如您在下面的代码中看到的那样，它应该转到第二个 except 块，并且当运行程序打印块内的调试打印时。但是，变量 html_doc 没有得到任何值。 f.read().decode... 语句似乎没有任何问题，因为现在注释掉的行与 try-except 块之外的相同语句有效完美。为什么会这样？我非常感谢任何关于如何解决问题的建议，因为到目前为止我无法自己解决。

def getSource(self, target_url):
    print(target_url)
    html_doc = None
    try:
        f = urllib.request.urlopen(target_url)
    except:
        return None
    #html_doc = f.read().decode("ISO-8859-1")
    try:
        html_doc = f.read().decode("utf-8")     # Save source code of URL to html_doc
        print(html_doc)
    except:
        print("I Went here")
        html_doc = f.read().decode("ISO-8859-1")   # Use other encoding if failed
        print("I SAID SO")
    print(html_doc)
    return html_doc

Answer 1

我建议先将文件读入变量，然后在关闭文件后对其进行解码。我相信这里发生的事情是你打开文件，读取数据，它失败了，然后你读取了更多的数据，但是没有更多的数据可以读取所以 html_doc 最终是空的。

所以像这样：

html_doc = f.read()
try:
    html_doc = html_doc.decode("utf-8")
except:
    html_doc = html_doc.decode("ISO-8859-1")

urllib 读取里面 except 不起作用

Urllib read inside except does not work

python

python-3.x

urllib

try-except