TimeoutError: Large amount of data in requests python

TimeoutError: Large amount of data in requests python

我正在尝试制作一个脚本,从 slideshare link 中抓取演示文稿并将其下载为 PDF。

脚本运行良好,直到幻灯片总数低于 20。在 python 中是否有替代 requests 的方法可以完成这项工作。

这是脚本:

import requests
from bs4 import BeautifulSoup
from PIL import Image
import io

URL_LESS = "https://www.slideshare.net/angelucmex/global-warming-2373190?qid=8f04572c-48df-4f53-b2b0-0eb71021931c&v=&b=&from_search=1"
URL="https://www.slideshare.net/tusharpanda88/python-basics-59573634?qid=03cb80ee-36f0-4241-a516-454ad64808a8&v=&b=&from_search=5"
r = requests.get(URL_LESS)

soup = BeautifulSoup(r.content, "html5lib")

imgs = soup.find_all('img', class_="slide-image")
imgSRC = [x.get("srcset").split(',')[0].strip().split(' ')[0].split('?')[0] for x in imgs]

imagesJPG = []
for img in imgSRC:
    im = requests.get(img)
    f = io.BytesIO(im.content)
    imgJPG = Image.open(f)
    imagesJPG.append(imgJPG)

imagesJPG[0].save(f"{soup.title.string}.pdf",save_all=True, append_images=imagesJPG[1:])

试试把URL_LESS改成URL,你就会明白了。

这是回溯

Traceback (most recent call last):
  File "D:\Work\py\scrapingScripts\tkinter\env\lib\site-packages\urllib3\connection.py", line 174, in _new_conn
    conn = connection.create_connection(
  File "D:\Work\py\scrapingScripts\tkinter\env\lib\site-packages\urllib3\util\connection.py", line 95, in create_connection
    raise err
  File "D:\Work\py\scrapingScripts\tkinter\env\lib\site-packages\urllib3\util\connection.py", line 85, in create_connection
    sock.connect(sa)
TimeoutError: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "D:\Work\py\scrapingScripts\tkinter\env\lib\site-packages\urllib3\connectionpool.py", line 703, in urlopen
    httplib_response = self._make_request(
  File "D:\Work\py\scrapingScripts\tkinter\env\lib\site-packages\urllib3\connectionpool.py", line 386, in _make_request
    self._validate_conn(conn)
  File "D:\Work\py\scrapingScripts\tkinter\env\lib\site-packages\urllib3\connectionpool.py", line 1040, in _validate_conn
    conn.connect()
  File "D:\Work\py\scrapingScripts\tkinter\env\lib\site-packages\urllib3\connection.py", line 358, in connect
    conn = self._new_conn()
  File "D:\Work\py\scrapingScripts\tkinter\env\lib\site-packages\urllib3\connection.py", line 186, in _new_conn
    raise NewConnectionError(
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x00000259643FF820>: Failed to establish a new connection: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "D:\Work\py\scrapingScripts\tkinter\env\lib\site-packages\requests\adapters.py", line 440, in send
    resp = conn.urlopen(
  File "D:\Work\py\scrapingScripts\tkinter\env\lib\site-packages\urllib3\connectionpool.py", line 785, in urlopen
    retries = retries.increment(
  File "D:\Work\py\scrapingScripts\tkinter\env\lib\site-packages\urllib3\util\retry.py", line 592, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='image.slidesharecdn.com', port=443): Max retries exceeded with url: /pythonbasics-160315100530/85/python-basics-8-320.jpg (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x00000259643FF820>: Failed to establish a new connection: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "d:\Work\py\scrapingScripts\slideshare\main.py", line 16, in <module>
    im = requests.get(img)
  File "D:\Work\py\scrapingScripts\tkinter\env\lib\site-packages\requests\api.py", line 75, in get
    return request('get', url, params=params, **kwargs)
  File "D:\Work\py\scrapingScripts\tkinter\env\lib\site-packages\requests\api.py", line 61, in request
    return session.request(method=method, url=url, **kwargs)
  File "D:\Work\py\scrapingScripts\tkinter\env\lib\site-packages\requests\sessions.py", line 529, in request
    resp = self.send(prep, **send_kwargs)
  File "D:\Work\py\scrapingScripts\tkinter\env\lib\site-packages\requests\sessions.py", line 645, in send
    r = adapter.send(request, **kwargs)
  File "D:\Work\py\scrapingScripts\tkinter\env\lib\site-packages\requests\adapters.py", line 519, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='image.slidesharecdn.com', port=443): Max retries exceeded with url: /pythonbasics-160315100530/85/python-basics-8-320.jpg (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x00000259643FF820>: Failed to establish a new connection: [WinError 10060] A connection attempt failed because the connected party did 
not properly respond after a period of time, or established connection failed because connected host has failed to respond'))

在使用 URL 和 URL_LESS 时,该脚本对我来说工作得很好,所以你的互联网可能是这里的罪魁祸首。

我的猜测是:

  1. 您正在使用 slow/inconsistent 互联网。

  2. Slideshare 正在将您的 IP 列入黑名单/web-agent可能是为了保护 DDOS。(不太可能)

  3. 您使用的是 ipv6,这对我来说是这类情况的罪魁祸首,请尝试将您的网络切换为仅使用 ipv4。

当涉及到请求时,我个人在相当长的一段时间内使用它来抓取相当大量的数据,所以我可以说这是一个很棒的库