python 请求：将 "referer" header 添加到重定向请求

Question

我想知道 python 请求是否支持 curl 中的 "autoreferer" 功能。基本上，对于 allow_redirects=True，请求应该为后续重定向请求自动设置 "Referer" header。

下面是请求 header 的样子（没有 "Referer" header）使用请求：

>>> import requests
>>> import logging
>>> import http.client
>>> http.client.HTTPConnection.debuglevel = 1
>>> logging.basicConfig()
>>> logging.getLogger().setLevel(logging.DEBUG)
>>> requests_log = logging.getLogger("requests.packages.urllib3")
>>> requests_log.setLevel(logging.DEBUG)
>>> requests_log.propagate = True
>>> r = requests.post('http://www.somewebsite.com', allow_redirects=True)
DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): www.somewebsite.com:80
send: b'POST / HTTP/1.1\r\nHost: www.somewebsite.com\r\nAccept: */*\r\nUser-Agent: python-requests/2.21.0\r\nConnection: keep-alive\r\nAccept-Encoding: gzip, deflate\r\nContent-Length: 0\r\n\r\n'
reply: 'HTTP/1.1 307 Temporary Redirect\r\n'
DEBUG:urllib3.connectionpool:http://www.somewebsite.com:80 "POST / HTTP/1.1" 307 185
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): www.somewebsite.com:443
header: Server header: Date header: Content-Type header: Content-Length header: Connection header: Location header: X-Cache header: Via header: X-Amz-Cf-Pop header: X-Amz-Cf-Id
send: b'POST / HTTP/1.1\r\nHost: www.somewebsite.com\r\nAccept: */*\r\nUser-Agent: python-requests/2.21.0\r\nConnection: keep-alive\r\nAccept-Encoding: gzip, deflate\r\nContent-Length: 0\r\n\r\n'
reply: 'HTTP/1.1 302 Moved Temporarily\r\n'
DEBUG:urllib3.connectionpool:https://www.somewebsite.com:443 "POST / HTTP/1.1" 302 13
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): somewebsite.com:443
header: Content-Type header: Content-Length header: Connection header: Date header: Location header: Access-Control-Allow-Origin header: X-Cache header: Via header: X-Amz-Cf-Pop header: X-Amz-Cf-Id
send: b'GET / HTTP/1.1\r\nHost: somewebsite.com\r\nAccept: */*\r\nUser-Agent: python-requests/2.21.0\r\nConnection: keep-alive\r\nAccept-Encoding: gzip, deflate\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
DEBUG:urllib3.connectionpool:https://somewebsite.com:443 "GET / HTTP/1.1" 200 149681
header: Content-Type header: Content-Length header: Connection header: Date header: Server header: Expires header: Last-Modified header: Content-Encoding header: Via header: Vary header: Accept-Ranges header: Cache-Control header: Set-Cookie header: X-Cache header: X-Amz-Cf-Pop header: X-Amz-Cf-Id >>> 
>>>

下面是使用 pycurl 的请求 header 的样子（使用 "Referer" header）：

>>> import pycurl
>>> from io import BytesIO
>>> buffer = BytesIO()
>>> c = pycurl.Curl()
>>> c.setopt(c.URL, 'http://www.somewebsite.com/')
>>> c.setopt(c.WRITEDATA, buffer)
>>> c.setopt(pycurl.VERBOSE, 1)
>>> c.setopt(pycurl.AUTOREFERER, 1)
>>> c.setopt(pycurl.FOLLOWLOCATION, 1)
>>> c.perform()
>>> c.close()
*   Trying 99.84.194.56...
* Connected to www.somewebsite.com (99.84.194.56) port 80 (#0)
> GET / HTTP/1.1
Host: www.somewebsite.com
User-Agent: PycURL/7.43.0.2 libcurl/7.47.0 OpenSSL/1.0.2g zlib/1.2.8 libidn/1.32 librtmp/2.3
Accept: */*

< HTTP/1.1 301 Moved Permanently
< Server: CloudFront
< Date: Wed, 26 Feb 2020 21:46:55 GMT
< Content-Type: text/html
< Content-Length: 183
< Connection: keep-alive
< Location: https://www.somewebsite.com/
< X-Cache: Redirect from cloudfront
< Via: 1.1 40ddfb9607f5d49c286c41e9afdce772.cloudfront.net (CloudFront)
< X-Amz-Cf-Pop: LAX3-C3
< X-Amz-Cf-Id: Uij3cpBtl0ZJ_OwFFDSint5ab3Ayvn0okmhJekgtxI-etIN5l07sjg==
< 
* Ignoring the response-body
* Connection #0 to host www.somewebsite.com left intact
* Issue another request to this URL: 'https://www.somewebsite.com/'
* Found bundle for host www.somewebsite.com: 0x2ab53b0 [can pipeline]
*   Trying 99.84.194.113...
* Connected to www.somewebsite.com (99.84.194.113) port 443 (#1)
* ALPN, offering http/1.1
* Cipher selection: ALL:!EXPORT:!EXPORT40:!EXPORT56:!aNULL:!LOW:!RC4:@STRENGTH
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/certs/ca-certificates.crt
  CApath: /etc/ssl/certs
* SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256
* ALPN, server accepted to use http/1.1
* Server certificate:
*    subject: CN=watchdisneyfe.com
*    start date: Dec 16 00:00:00 2019 GMT
*    expire date: Jan 16 12:00:00 2021 GMT
*    subjectAltName: www.somewebsite.com matched
*    issuer: C=US; O=Amazon; OU=Server CA 1B; CN=Amazon
*    SSL certificate verify ok.
> GET / HTTP/1.1
Host: www.somewebsite.com
User-Agent: PycURL/7.43.0.2 libcurl/7.47.0 OpenSSL/1.0.2g zlib/1.2.8 libidn/1.32 librtmp/2.3
Accept: */*
Referer: http://www.somewebsite.com/

< HTTP/1.1 302 Moved Temporarily
< Content-Type: text/plain
< Content-Length: 13
< Connection: keep-alive
< Date: Wed, 26 Feb 2020 21:46:55 GMT
< Location: https://somewebsite.com/
< Access-Control-Allow-Origin: *
< X-Cache: Miss from cloudfront
< Via: 1.1 74d35431a23bfc97a6055173d9be2dc4.cloudfront.net (CloudFront)
< X-Amz-Cf-Pop: LAX3-C3
< X-Amz-Cf-Id: Bxg1W9zPN7U4i8GqysA11vj6h2dyDZdClyMUfUMfVUqd-v_mrQXGhQ==
< 
* Ignoring the response-body
* Connection #1 to host www.somewebsite.com left intact
* Issue another request to this URL: 'https://somewebsite.com/'
*   Trying 13.225.146.93...
* Connected to somewebsite.com (13.225.146.93) port 443 (#2)
* ALPN, offering http/1.1
* Cipher selection: ALL:!EXPORT:!EXPORT40:!EXPORT56:!aNULL:!LOW:!RC4:@STRENGTH
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/certs/ca-certificates.crt
  CApath: /etc/ssl/certs
* SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256
* ALPN, server accepted to use http/1.1
* Server certificate:
*    subject: CN=watchdisneyfe.com
*    start date: Dec 16 00:00:00 2019 GMT
*    expire date: Jan 16 12:00:00 2021 GMT
*    subjectAltName: somewebsite.com matched
*    issuer: C=US; O=Amazon; OU=Server CA 1B; CN=Amazon
*    SSL certificate verify ok.
> GET / HTTP/1.1
Host: somewebsite.com
User-Agent: PycURL/7.43.0.2 libcurl/7.47.0 OpenSSL/1.0.2g zlib/1.2.8 libidn/1.32 librtmp/2.3
Accept: */*
Referer: https://www.somewebsite.com/

< HTTP/1.1 200 OK
< Content-Type: text/html; charset=utf-8
< Content-Length: 1218349
< Connection: keep-alive
< Vary: Accept-Encoding
< Date: Wed, 26 Feb 2020 21:46:55 GMT
< Server: nginx/1.16.1
< Expires: Wed, 26 Feb 2020 21:56:48 GMT
< Last-Modified: Wed, 26 Feb 2020 21:56:48 GMT
< Via: 1.1 varnish-v4, 1.1 a52dcb1fed052adbd58b868375961d24.cloudfront.net (CloudFront)
< Vary: Accept-Encoding
< Accept-Ranges: bytes
< Cache-Control: max-age=0, must-revalidate
< Set-Cookie: SWID=72B09DFD-D038-485C-C836-7229EB59F0B1; path=/; Expires=Sun, 26 Feb 2040 21:46:55 GMT; domain=somewebsite.com;
< X-Cache: Miss from cloudfront
< X-Amz-Cf-Pop: LAX3-C4
< X-Amz-Cf-Id: JGF1k-OnDIZT_1DP5psnrlb9jmmp7rq69QbGNZL1CVGbjJWjORwpGQ==
< 
* Connection #2 to host somewebsite.com left intact

是否可以像 curl 一样自动添加 "Referer" header？

注意：如果您想尝试一下，请将 "somewebsite" 替换为 "abc"，例如。

Answer 1

requests 没有这个任务的任何官方挂钩。但是您可以子类 requests.Session 来包装为每个重定向调用的方法：Session.rebuild_auth():

When being redirected we may want to strip authentication from the request to avoid leaking credentials. This method intelligently removes and reapplies authentication where possible to avoid credential loss.

因为它与下一个（准备好的）请求以及触发重定向的前一个响应一起调用，所以它非常适合添加 Referer header:

import requests

class RefererSession(requests.Session):
    def rebuild_auth(self, prepared_request, response):
        super().rebuild_auth(prepared_request, response)
        prepared_request.headers["Referer"] = response.url

然后将此子类用于所有请求：

with RefererSession() as session:
    r = session.post('http://www.somewebsite.com', allow_redirects=True)

演示使用 https://httpbin.org:

>>> import requests
>>> import http.client
>>> http.client.HTTPConnection.debuglevel = 1
>>> def echo_request_lines(msg, *rest):
...     """HTTPConnection debug print handler, writes out request lines"""
...     if msg != 'send:': return
...     request_lines = literal_eval(rest[0]).replace(b'\r', b'')
...     print(request_lines.rstrip().decode('latin1'))
...     print()
...
>>> http.client.HTTPConnection.debuglevel = 1
>>> http.client.print = echo_request_lines
>>> class RefererSession(requests.Session):
...     def rebuild_auth(self, prepared_request, response):
...         super().rebuild_auth(prepared_request, response)
...         prepared_request.headers["Referer"] = response.url
...
>>> with RefererSession() as session:
...     r = session.get('https://httpbin.org/redirect/2')
...
GET /redirect/2 HTTP/1.1
Host: httpbin.org
User-Agent: python-requests/2.22.0
Accept-Encoding: gzip, deflate
Accept: */*
Connection: keep-alive

GET /relative-redirect/1 HTTP/1.1
Host: httpbin.org
User-Agent: python-requests/2.22.0
Accept-Encoding: gzip, deflate
Accept: */*
Connection: keep-alive
Referer: https://httpbin.org/redirect/2

GET /get HTTP/1.1
Host: httpbin.org
User-Agent: python-requests/2.22.0
Accept-Encoding: gzip, deflate
Accept: */*
Connection: keep-alive
Referer: https://httpbin.org/relative-redirect/1

>>> from pprint import pprint
>>> pprint(dict(r.history[1].request.headers))
{'Accept': '*/*',
 'Accept-Encoding': 'gzip, deflate',
 'Connection': 'keep-alive',
 'Referer': 'https://httpbin.org/redirect/2',
 'User-Agent': 'python-requests/2.22.0'}
>>> pprint(dict(r.request.headers))
{'Accept': '*/*',
 'Accept-Encoding': 'gzip, deflate',
 'Connection': 'keep-alive',
 'Referer': 'https://httpbin.org/relative-redirect/1',
 'User-Agent': 'python-requests/2.22.0'}

python 请求：将 "referer" header 添加到重定向请求

python requests: adding "referer" header to redirected requests

python

http-referer

pycurl

request-headers

python-requests