重新尝试打开 url with urllib in python 超时

Question

我希望使用 Python (>10k) 解析来自大量网页的数据，我发现我为此编写的函数每 500 次循环经常遇到超时错误。我试图用 try - except 代码块来解决这个问题，但我想改进这个功能，这样它会在返回错误之前重新尝试打开 url 四到五次。有没有一种优雅的方法可以做到这一点？

我的代码如下：

def url_open(url):
    from urllib.request import Request, urlopen
    req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
    try:
        s = urlopen(req,timeout=50).read()
    except urllib.request.HTTPError as e:
        if e.code == 404:
            print(str(e))
        else:
            print(str(e))
            s=urlopen(req,timeout=50).read()
            raise
    return BeautifulSoup(s, "lxml")

Answer 1

我过去曾使用这样的模式进行重试：

def url_open(url):
    from urllib.request import Request, urlopen
    req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
    retrycount = 0
    s = None
    while s is None:
        try:
            s = urlopen(req,timeout=50).read()
        except urllib.request.HTTPError as e:
            print(str(e))
            if canRetry(e.code):
                retrycount+=1
                if retrycount > 5:
                    raise
                # thread.sleep for a bit
            else:
                raise 

    return BeautifulSoup(s, "lxml")

您只需在其他地方定义 canRetry。

重新尝试打开 url with urllib in python 超时

Re-attempt to open url with urllib in python on timeout

python

urllib

python-3.x

bs4