对 API 的多个 URL 请求没有从 urllib2 或请求中收到错误

Multiple URL requests to API without getting error from urllib2 or requests

我正在尝试从不同的 API 获取数据。它们以 JSON 格式接收,存储在 SQLite 中,然后进行解析。

我遇到的问题是,当发送许多请求时,我最终会收到一个错误,即使我在请求之间使用 time.sleep

通常的做法

我的代码看起来像下面的代码,它在一个循环中并且要打开的 url 会发生变化:

base_url = 'https://www.whateversite.com/api/index.php?'
custom_url = 'variable_text1' + & + 'variable_text2' 

url = base_url + custom_urls #url will be changing

time.sleep(1)
data = urllib2.urlopen(url).read() 

这在循环中运行了数千次。问题出现在脚本 运行 一段时间(最多几个小时)之后,然后我收到以下错误或类似错误:

    data = urllib2.urlopen(url).read()
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 127, in urlopen
    return _opener.open(url, data, timeout)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 404, in open
    response = self._open(req, data)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 422, in _open
    '_open', req)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 382, in _call_chain
    result = func(*args)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 1222, in https_open
    return self.do_open(httplib.HTTPSConnection, req)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 1184, in do_open
    raise URLError(err)
urllib2.URLError: <urlopen error [Errno 8] nodename nor servname provided, or not known>

    uh = urllib.urlopen(url)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 87, in urlopen
    return opener.open(url)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 208, in open
    return getattr(self, name)(url)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 437, in open_https
    h.endheaders(data)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 969, in endheaders
    self._send_output(message_body)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 829, in _send_output
    self.send(msg)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 791, in send
    self.connect()
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 1172, in connect
    self.timeout, self.source_address)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/socket.py", line 553, in create_connection
    for res in getaddrinfo(host, port, 0, SOCK_STREAM):
IOError: [Errno socket error] [Errno 8] nodename nor servname provided, or not known

我认为发生这种情况是因为如果在短时间内过于频繁地使用模块,它们会在某个时候抛出错误。

根据我在许多不同话题中所读到的关于 which module is better 的内容,我认为满足我的需要所有都可以,选择一个的主要杠杆是它可以打开尽可能多的 url'尽可能。根据我的经验,urlliburllib2requests 好,因为 requests 崩溃的时间更短。

假设我不想增加time.sleep中使用的等待时间,这些是我目前想到的解决方案:

可能的解决方案?

一个

我想到了组合所有不同的模块。那将是:

B

使用 try .. except 块来处理该异常,如建议的那样 here

C

我也读到了 。我不知道它到底是如何工作的,也不知道它是否真的有用


但是,我不相信这些解决方案中的任何一个。

你能想出更优雅的and/or有效解决方案来处理这个错误吗?

我用 Python 2.7

即使我不相信,我最终还是尝试实施 try .. except 块,我对结果非常满意:

for url in list_of_urls:
    time.sleep(2)
    try:
        response = urllib2.urlopen(url)
        data = response.read()
        time.sleep(0.1)
        response.close() #as suggested by zachyee in the comments

        #code to save data in SQLite database

    except urllib2.URLError as e:
        print '***** urllib2.URLError: <urlopen error [Errno 8] nodename nor servname provided, or not known> *****'
        #save error in SQLite
        cur.execute('''INSERT INTO Errors (error_type, error_ts, url_queried)
        VALUES (?, ?, ?)''', ('urllib2.URLError', timestamp, url))
        conn.commit()
        time.sleep(30) #give it a small break

脚本 运行 直到结束。

从数以千计的请求中我得到了 8 个错误,这些错误与相关 URL 一起保存在我的数据库中。这样我可以在需要时再次尝试检索那些 url。