为什么 GET 使用 Python 请求请求 gz 数据时出现 ConnectionError

Question

我正在尝试从 Appnexus api 请求批量 log-level 数据。根据官方数据服务指南，主要有四个步骤：

1.账户认证 -> return token in Json

2。获取可用数据源列表并查找下载参数 -> return Json

中的参数

3。 GET 通过下载参数获取请求文件下载位置码 -> 从header

中提取位置码

4.通过位置代码获取下载日志数据文件 -> return gz 数据文件

这些步骤在 Terminal 中使用 curl:

完美运行

curl -b cookies -c cookies -X POST -d @auth 'https://api.appnexus.com/auth'
curl -b cookies -c cookies 'https://api.appnexus.com/siphon?siphon_name=standard_feed'
curl --verbose -b cookies -c cookies 'https://api.appnexus.com/siphon-download?siphon_name=standard_feed&hour=2017_12_28_09&timestamp=20171228111358&member_id=311&split_part=0'
curl -b cookies -c cookies 'http://data-api-gslb.adnxs.net/siphon-download/[location code]' > ./data_download/log_level_feed.gz

在 Python 中，我正在尝试用同样的方法来测试 api。但是，它一直给我“ConnectionError”。在 步骤 1-2 中，它仍然运行良好，因此我成功地从 Json 响应中获取了参数，为 步骤构建了 url 3 其中我需要请求位置代码并从响应的 header 中提取它。

第一步：

# Step 1
############ Authentication ###########################    
# Select End-Point
auth_endpoint = 'https://api.appnexus.com/auth'

# API Key
auth_app = json.dumps({'auth':{'username':'xxxxxxx','password':'xxxxxxx'}})

# Proxy
proxy = {'https':'https://proxy.xxxxxx.net:xxxxx'}
r = requests.post(auth_endpoint, proxies=proxy, data=auth_app)
data = json.loads(r.text)
token = data['response']['token']

第二步：

# Step 2
########### Check report list ###################################
check_list_endpoint = 'https://api.appnexus.com/siphon?siphon_name=standard_feed'
report_list = requests.get(check_list_endpoint, proxies=proxy, headers={"Authorization":token})
data = json.loads(report_list.text)
print(str(len(data['response']['siphons'])) + ' previous hours available for download')

# Build url for single report - extract para
download_endpoint = 'https://api.appnexus.com/siphon-download'
siphon_name = 'siphon_name=standard_feed' 
hour = 'hour=' + data['response']['siphons'][400]['hour']
timestamp = 'timestamp=' + data['response']['siphons'][400]['timestamp'] 
member_id = 'member_id=311' 
split_part = 'split_part=' + data['response']['siphons'][400]['splits'][0]['part']

# Build url
download_endpoint_url = download_endpoint + '?' + \
siphon_name + '&' + \
hour + '&' + \
timestamp + '&' + \
member_id + '&' + \
split_part
# Check
print(download_endpoint_url)

然而，运行完成后，步骤 3 中的 "requests.get" 一直给出“ConnectionError“ 警告。另外，我发现"location code"其实是在“/siphon-download/”之后的警告信息中。因此，我使用 "try..except" 从警告消息中提取它并保留代码运行.

第三步：

# Step 3
######### Extract location code for target report ####################
try:
    TT = requests.get(download_endpoint_url, proxies=proxy, headers={"Authorization":token}, timeout=1)
except ConnectionError, e:
    text = e.args[0].args[0]
    m = re.search('/siphon-download/(.+?) ', text)
    if m:
        location = m.group(1)
print('Successfully Extracting location: ' + location)

原始警告消息 Step3中没有"try..except":

ConnectionError: HTTPConnectionPool(host='data-api-gslb.adnxs.net', port=80): Max retries exceeded with url: 
/siphon-download/dbvjhadfaslkdfa346583 
(Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x0000000007CBC7B8>: 
Failed to establish a new connection: [Errno 10060] A connection attempt failed because the connected party did not 
properly respond after a period of time, or established connection failed because connected host has failed to respond',))

然后，我尝试使用从之前的警告消息中提取的位置代码发出最后一个 GET 请求，以下载 gz 数据文件，就像我在终端中使用 "curl" 所做的那样。但是，我收到了相同的警告消息 - ConnectionError。

第四步：

# Step 4
######## Download data file #######################
extraction_location = 'http://data-api-gslb.adnxs.net/siphon-download/' + location
LLD = requests.get(extraction_location, proxies=proxy, headers={"Authorization":token}, timeout=1)

第 4 步中的原始警告消息：

ConnectionError: HTTPConnectionPool(host='data-api-gslb.adnxs.net', port=80): Max retries exceeded with url: 
/siphon-download/dbvjhadfaslkdfa346583 
(Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x0000000007BE15C0>: 
Failed to establish a new connection: [Errno 10060] A connection attempt failed because the connected party did not 
properly respond after a period of time, or established connection failed because connected host has failed to respond',))

为了仔细检查，我在终端中使用 curl 测试了我的 Python 脚本中生成的所有端点、参数和位置代码。它们都工作正常，下载的数据是正确的。任何人都可以帮助我解决 Python 中的这个问题，或者指出正确的方向来发现为什么会这样吗？非常感谢！

Answer 1

1) 在 curl 中，您正在读写 cookies (-b cookies -c cookies)。对于您没有使用 session objects http://docs.python-requests.org/en/master/user/advanced/#session-objects 的请求，因此您的 cookie 数据将丢失。

2) 您定义了一个 https 代理，然后您试图通过没有代理的 http 连接（到 data-api-gslb.adnxs.net）。同时定义 http 和 https，但只在 session object 上定义一次。参见 http://docs.python-requests.org/en/master/user/advanced/#proxies。（这可能是您看到的错误消息的根本原因。）

3) 请求自动处理重定向，无需提取位置 header 并在下一个请求中使用它，它会自动被重定向。所以当其他错误被修复时，有 3 个步骤而不是 4 个。（这也回答了上面评论中 Hetzroni 的问题。）

所以使用

s = requests.Session() 
s.proxies = {
               'http':'http://proxy.xxxxxx.net:xxxxx',
               'https':'https://proxy.xxxxxx.net:xxxxx'
             } # set this only once using valid proxy urls.

然后使用

s.get()

和

s.post()

而不是

requests.get()

和

requests.post()

为什么 GET 使用 Python 请求请求 gz 数据时出现 ConnectionError

Why ConnectionError when GET requests gz data using Python requests

python

api

curl

urllib2

python-requests