使用 python 的请求模块抓取继承页面

Question

我想使用 pythons requests 模块抓取 Heritrix 主页。当我尝试在 chrome 上打开此页面时，出现错误：

This server could not prove that it is 10.100.121.41; its security  
certificate is not trusted by your computer's operating system. This   
may be caused by a misconfiguration or an attacker intercepting your    
connection.

但我可以继续访问该页面。当我尝试使用 requests 抓取同一页面时，出现 SSL 错误，经过一番挖掘，我使用了 a SO question 中的以下代码：r=requests.get(url,auth=(username, password),verify=False.那就是给我下面的warning/usr/lib/python2.6/site-packages/requests/packages/urllib3/connectionpool.py:734: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.org/en/latest/security.html，返回的状态码是401，这个问题怎么解决？

Answer 1

401 表示您需要进行身份验证，但您使用了错误的方法。 requests 内置的另一种非常常见的身份验证方法是摘要身份验证。您可以通过查看以下内容来确定它是否正在使用摘要式身份验证：

r.headers.get('www-authenticate')

应该有digest。（如果没有，则它不需要摘要式身份验证。）您可以在请求中使用摘要式身份验证，如下所示：

from requests import auth

r = requests.get(url, auth=auth.HTTPDigestAuth(username, password), verify=False)

您看到的警告与 401 无关，它只是警告您，您发出的请求是针对 HTTPS 站点的，您的连接可能是有效的中间人攻击攻击者。如果你想让它静音，你可以执行以下操作：

from requests.packages import urllib3
urllib3.disable_warnings()

使用 python 的请求模块抓取继承页面

scraping a heritrix page using python's request module

ssl

python-requests

heritrix