网络爬虫 - TooManyRedirects:超过 30 个重定向。 (python)
Web Crawler - TooManyRedirects: Exceeded 30 redirects. (python)
我已尝试遵循其中一个 youtube 教程
但是我遇到了一些问题。
有人能帮忙吗?
我是python的新手,我知道有一两个类似的问题,但是,我读了但不明白。
有人可以帮我吗?
谢谢
import requests
from bs4 import BeautifulSoup
def trade_spider(max_pages):
page = 1
while page <= max_pages:
url = "https://www.thenewboston.com/forum/home.php?page=" + str(page)
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for link in soup.findAll('a', {'class': 'post-title'}):
href = link.get('href')
print(href)
page += 1
trade_spider(2)
在运行程序上我得到如下错误。
Traceback (most recent call last):
File "C:/Users/User/PycharmProjects/Basic/WebCrawlerTest.py", line 19, in <module>
trade_spider(2)
File "C:/Users/User/PycharmProjects/Basic/WebCrawlerTest.py", line 9, in trade_spider
source_code = requests.get(url)
File "C:\Users\User\AppData\Roaming\Python\Python34\site-packages\requests\api.py", line 69, in get
return request('get', url, params=params, **kwargs)
File "C:\Users\User\AppData\Roaming\Python\Python34\site-packages\requests\api.py", line 50, in request
response = session.request(method=method, url=url, **kwargs)
File "C:\Users\User\AppData\Roaming\Python\Python34\site-packages\requests\sessions.py", line 465, in request
resp = self.send(prep, **send_kwargs)
File "C:\Users\User\AppData\Roaming\Python\Python34\site-packages\requests\sessions.py", line 594, in send
history = [resp for resp in gen] if allow_redirects else []
File "C:\Users\User\AppData\Roaming\Python\Python34\site-packages\requests\sessions.py", line 594, in <listcomp>
history = [resp for resp in gen] if allow_redirects else []
File "C:\Users\User\AppData\Roaming\Python\Python34\site-packages\requests\sessions.py", line 114, in resolve_redirects
raise TooManyRedirects('Exceeded %s redirects.' % self.max_redirects)
requests.exceptions.TooManyRedirects: Exceeded 30 redirects.
好吧,您尝试抓取的页面似乎完全损坏:尝试将 https://www.thenewboston.com/forum/home.php?page=1 放入您的网络浏览器:当我尝试使用 Chrome 时,我收到错误消息:
This webpage has a redirect loop
ERR_TOO_MANY_REDIRECTS
您必须自己决定如何处理抓取工具中的此类损坏页面。
该论坛的 url 已更改
对您的代码进行两次修改
Changed the forum
1.url(https://www.thenewboston.com/forum/recent_activity.php?page=" +
str(page))
allow_redirects=False (to disable redirects if any).
import requests
from bs4 import BeautifulSoup
def trade_spider(max_pages):
page = 1
while page <= max_pages:
url = "https://www.thenewboston.com/forum/recent_activity.php?page=" + str(page)
print url
source_code = requests.get(url, allow_redirects=False)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for link in soup.findAll('a', {'class': 'post-title'}):
href = link.get('href')
print(href)
page += 1
print trade_spider(2)
我已尝试遵循其中一个 youtube 教程 但是我遇到了一些问题。 有人能帮忙吗? 我是python的新手,我知道有一两个类似的问题,但是,我读了但不明白。 有人可以帮我吗? 谢谢
import requests
from bs4 import BeautifulSoup
def trade_spider(max_pages):
page = 1
while page <= max_pages:
url = "https://www.thenewboston.com/forum/home.php?page=" + str(page)
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for link in soup.findAll('a', {'class': 'post-title'}):
href = link.get('href')
print(href)
page += 1
trade_spider(2)
在运行程序上我得到如下错误。
Traceback (most recent call last):
File "C:/Users/User/PycharmProjects/Basic/WebCrawlerTest.py", line 19, in <module>
trade_spider(2)
File "C:/Users/User/PycharmProjects/Basic/WebCrawlerTest.py", line 9, in trade_spider
source_code = requests.get(url)
File "C:\Users\User\AppData\Roaming\Python\Python34\site-packages\requests\api.py", line 69, in get
return request('get', url, params=params, **kwargs)
File "C:\Users\User\AppData\Roaming\Python\Python34\site-packages\requests\api.py", line 50, in request
response = session.request(method=method, url=url, **kwargs)
File "C:\Users\User\AppData\Roaming\Python\Python34\site-packages\requests\sessions.py", line 465, in request
resp = self.send(prep, **send_kwargs)
File "C:\Users\User\AppData\Roaming\Python\Python34\site-packages\requests\sessions.py", line 594, in send
history = [resp for resp in gen] if allow_redirects else []
File "C:\Users\User\AppData\Roaming\Python\Python34\site-packages\requests\sessions.py", line 594, in <listcomp>
history = [resp for resp in gen] if allow_redirects else []
File "C:\Users\User\AppData\Roaming\Python\Python34\site-packages\requests\sessions.py", line 114, in resolve_redirects
raise TooManyRedirects('Exceeded %s redirects.' % self.max_redirects)
requests.exceptions.TooManyRedirects: Exceeded 30 redirects.
好吧,您尝试抓取的页面似乎完全损坏:尝试将 https://www.thenewboston.com/forum/home.php?page=1 放入您的网络浏览器:当我尝试使用 Chrome 时,我收到错误消息:
This webpage has a redirect loop
ERR_TOO_MANY_REDIRECTS
您必须自己决定如何处理抓取工具中的此类损坏页面。
该论坛的 url 已更改
对您的代码进行两次修改
Changed the forum 1.url(https://www.thenewboston.com/forum/recent_activity.php?page=" + str(page))
allow_redirects=False (to disable redirects if any).
import requests
from bs4 import BeautifulSoup
def trade_spider(max_pages):
page = 1
while page <= max_pages:
url = "https://www.thenewboston.com/forum/recent_activity.php?page=" + str(page)
print url
source_code = requests.get(url, allow_redirects=False)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for link in soup.findAll('a', {'class': 'post-title'}):
href = link.get('href')
print(href)
page += 1
print trade_spider(2)