urllib.error.HTTPError: HTTP Error 302

Question

我正在尝试使用 HTML 解析器解析使用 Python3.6 的网站，但它抛出如下错误：

urllib.error.HTTPError: HTTP Error 302: The HTTP server returned a redirect error that would lead to an infinite loop. The last 30x error message was: Found The code I wrote is as below: {

from urllib.request import urlopen as uo
from bs4 import BeautifulSoup
import ssl

# Ignore SSL Certification
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = input('Enter--')
html = uo(url,context = ctx).read()

soup = BeautifulSoup(html,"html.parser")

print(soup)
#retrieve all the anchor tags
#tags = soup('a')

}

谁能告诉我为什么会抛出这个错误，这意味着什么以及如何解决这个错误？

Answer 1

如评论所述：

That site sets a cookie and then redirects to /Home.aspx.

要避免此站点上的重定向循环，您必须设置 24 个字符的 ASP.NET_SessionId cookie。

import urllib.request
opener = urllib.request.build_opener()
opener.addheaders.append(('Cookie', 'ASP.NET_SessionId=garbagegarbagegarbagelol'))
f = opener.open("http://apnakhata.raj.nic.in/")
html = f.read()

但是，我只使用 requests。

import requests

r = requests.get('http://apnakhata.raj.nic.in/')
html = r.text

它默认将 cookie 保存到 RequestsCookieJar。在初始请求之后，只会发生一次重定向。你可以在这里看到它：

>>> r.history[0]
[<Response [302]>]

>>> r.history[0].cookies
<RequestsCookieJar[Cookie(version=0, name='ASP.NET_SessionId', value='ph0chopmjlpi1dg0f3xtbacu', port=None, port_specified=False, domain='apnakhata.raj.nic.in', domain_specified=False, domain_initial_dot=False, path='/', path_specified=True, secure=False, expires=None, discard=True, comment=None, comment_url=None, rest={'HttpOnly': None}, rfc2109=False)]>

要抓取页面，您可以使用由同一作者创建的 requests_html。

from requests_html import HTMLSession
session = HTMLSession()
r = session.get('http://apnakhata.raj.nic.in/')

获取链接非常容易：

>>> r.html.absolute_links
{'http://apnakhata.raj.nic.in/',
'http://apnakhata.raj.nic.in/Cyberlist.aspx',
...
'http://apnakhata.raj.nic.in/rev_phone.aspx'}

Answer 2

超文本传输协议 (HTTP) 302 找到重定向状态响应代码表示请求的资源已暂时移动到位置 header 给出的 URL。浏览器重定向到此页面，但搜索引擎不会更新其指向资源的链接（在 'SEO-speak' 中，据说 'link-juice' 不会发送到新的 URL）。

即使规范要求在执行重定向时不更改方法（和 body），并非所有 user-agents 都符合此处 - 您仍然可以找到此类有漏洞的软件在那里。因此，建议仅将 302 代码设置为对 GET 或 HEAD 方法的响应，并使用 307 Temporary Redirect instead，因为在这种情况下明确禁止更改方法。

如果您希望使用的方法更改为 GET，请改用 303 See Other。当您想对不是上传资源的 PUT 方法做出响应而是确认消息时，这很有用，例如：'you successfully uploaded XYZ'.

urllib.error.HTTPError: HTTP Error 302

urllib.error.HTTPError: HTTP Error 302

urllib

beautifulsoup

html-parser

python-3.x