从 Zendesk API 获取数据时,为什么 HTTP 状态不佳且 ProtocolError('Connection aborted.', BadStatusLine("''",)) ?

Why bad HTTP status with ProtocolError('Connection aborted.', BadStatusLine("''",)) when getting data from Zendesk API?

我正在尝试从 Zendesk API 获取 user identities 几十万 user ids,使用 Python 3.4.3 requests 图书馆。它适用于许多用户 ID,然后我的程序收到来自 Zendesk 的错误响应 API。

下面是相关的Python函数:

def get_user_identities(user_id):
  url = config.zendesk_api_url + '/api/v2/users/' + user_id + '/identities.json'

  session = requests.Session()
  session.auth = config.credentials

  response = ''

  while True:
    try:
      response = session.get(url)
    except requests.ConnectionError as error:
      logger.error("ConnectionError: {0}".format(error))
      num_seconds = 30
      logger.info("Sleeping for {} seconds...".format(num_seconds))
      time.sleep(num_seconds)
    else:
      break

  while True:
    response = session.get(url)
    if response.status_code == 429:
      logger.info('Rate limited! Waiting for {} seconds'.format(response.headers['retry-after']))
      time.sleep(int(response.headers['retry-after']))
    else:
      break

  if response.status_code != 200:
    logger.error('Error with status code {}'.format(response.status_code))
    exit()

  data = response.json()

此函数在循环中调用,为数千名用户 检索 user identity 没有任何问题,但随后由于 而退出错误的 HTTP 响应状态:

Traceback (most recent call last):
  File "/usr/local/lib/python3.4/dist-packages/urllib3/connectionpool.py", line 595, in urlopen
    chunked=chunked)
  File "/usr/local/lib/python3.4/dist-packages/urllib3/connectionpool.py", line 393, in _make_request
    six.raise_from(e, None)
  File "<string>", line 2, in raise_from
  File "/usr/local/lib/python3.4/dist-packages/urllib3/connectionpool.py", line 389, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/lib/python3.4/http/client.py", line 1171, in getresponse
    response.begin()
  File "/usr/lib/python3.4/http/client.py", line 351, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python3.4/http/client.py", line 321, in _read_status
    raise BadStatusLine(line)
http.client.BadStatusLine: ''

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/requests/adapters.py", line 330, in send
    timeout=timeout
  File "/usr/local/lib/python3.4/dist-packages/urllib3/connectionpool.py", line 640, in urlopen
    _stacktrace=sys.exc_info()[2])
  File "/usr/local/lib/python3.4/dist-packages/urllib3/util/retry.py", line 287, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='companyname.zendesk.com', port=443): Max retries exceeded with url: /api/v2/users/1608220001/identities.json (Caused by ProtocolError('Connection aborted.', BadStatusLine("''",)))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/emre.sevinc/code/company-zendesk/get_user_identities.py", line 72, in <module>
    get_user_identities(user_id)
  File "/home/emre.sevinc/code/company-zendesk/get_user_identities.py", line 42, in get_user_identities
    response = session.get(url)
  File "/usr/lib/python3/dist-packages/requests/sessions.py", line 467, in get
    return self.request('GET', url, **kwargs)
  File "/usr/lib/python3/dist-packages/requests/sessions.py", line 455, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/lib/python3/dist-packages/requests/sessions.py", line 558, in send
    r = adapter.send(request, **kwargs)
  File "/usr/lib/python3/dist-packages/requests/adapters.py", line 378, in send
    raise ConnectionError(e)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='companyname.zendesk.com', port=443): Max retries exceeded with url: /api/v2/users/1608220001/identities.json (Caused by ProtocolError('Connection aborted.', BadStatusLine("''",)))

但是当我测试相同的 URL 以使用 HTTPie 获取用户身份时,它工作得很好:

$ http -a user@company.com:password https://companyname.zendesk.com/api/v2/users/1608220001/identities.json

HTTP/1.1 200 OK
Cache-Control: must-revalidate, private, max-age=0
Connection: keep-alive
Content-Encoding: gzip
Content-Type: application/json; charset=UTF-8
Date: Tue, 12 Sep 2017 15:11:39 GMT
ETag: W/"8135d41f9068e1c2b45d0f307c6431d4"
Last-Modified: Mon, 09 Nov 2015 20:55:44 GMT
Server: nginx
Strict-Transport-Security: max-age=31536000;
Transfer-Encoding: chunked
X-Content-Type-Options: nosniff
X-Frame-Options: SAMEORIGIN
X-Rack-Cache: miss
X-Rate-Limit: 700
X-Rate-Limit-Remaining: 416
X-Request-Id: f1320883-caf0-4d33-cd94-a0369f4368f8
X-Runtime: 0.381444
X-UA-Compatible: IE=Edge,chrome=1
X-Zendesk-API-Version: v2
X-Zendesk-Application-Version: v40.20
X-Zendesk-Origin-Server: app15.pod3.dub1.zdsys.com
X-Zendesk-Request-Id: a0606a3ae1d043968f53

{
    "count": 1, 
    "identities": [
        {
            "created_at": "2015-11-09T20:55:44Z", 
            "deliverable_state": "deliverable", 
            "id": 1020870341, 
            "primary": true, 
            "type": "email", 
...

难道 Zendesk REST API 端点是 'thinking' 我正试图 "scrape" 它并故意断开连接?根据 ?

的建议

或者是别的东西,你有什么建议让它起作用吗? (除了伪造用户代理?)

显然,代码必须再捕获一个异常 urllib3.exceptions.MaxRetryError 和一个 HTTP 状态代码 (BAD_GATEWAY_ERROR = 502),以解决 Zendesk REST API 端点抛出的问题在它:

BAD_GATEWAY_ERROR = 502
RATE_LIMITED_ERROR = 429
MAX_NUM_SECONDS_TO_SLEEP = 30
MAX_NUM_OF_ALLOWED_RETRIES = 10


def get_user_identities(user_id):
  url = config.zendesk_api_url + '/api/v2/users/' + user_id + '/identities.json'

  session = requests.Session()
  session.auth = config.credentials

  script_path = get_script_path()

  num_retries = 0
  response = ''

  while True:
    if num_retries > MAX_NUM_OF_ALLOWED_RETRIES:
      logger.error('Tried more than {} times without success. Skipping the user id {} .'
                   .format(MAX_NUM_OF_ALLOWED_RETRIES, user_id))
      return

    try:
      response = session.get(url)

      if response.status_code == RATE_LIMITED_ERROR:
        logger.info('Rate limited! Waiting for {} seconds and will try again.'
                    .format(response.headers['retry-after']))
        time.sleep(int(response.headers['retry-after']))
        num_retries += 1
        continue

      if response.status_code == BAD_GATEWAY_ERROR:
        logger.info('Bad Gateway Error. Waiting for {} seconds and will try again.'
                    .format(str(MAX_NUM_SECONDS_TO_SLEEP)))
        time.sleep(MAX_NUM_SECONDS_TO_SLEEP)
        num_retries += 1
        continue

      if response.status_code != 200:
        logger.error('Error with status code {}. Skipping the user id {}'
                     .format(response.status_code, user_id))
        return

    except (requests.ConnectionError, urllib3.exceptions.MaxRetryError) as error:
      logger.error("ConnectionError: {0}".format(error))
      logger.info("Sleeping for {} seconds...".format(MAX_NUM_SECONDS_TO_SLEEP))
      time.sleep(MAX_NUM_SECONDS_TO_SLEEP)
      num_retries += 1
    else:
      break

  data = response.json()

进行上述更改后,它能够从 Zendesk REST API 端点成功检索超过 700.000 条记录。

我遇到的问题类似于 Zendesk 服务器在这种情况下的行为。