TimeoutError: Large amount of data in requests python

Question

我正在尝试制作一个脚本，从 slideshare link 中抓取演示文稿并将其下载为 PDF。

脚本运行良好，直到幻灯片总数低于 20。在 python 中是否有替代 requests 的方法可以完成这项工作。

这是脚本：

import requests
from bs4 import BeautifulSoup
from PIL import Image
import io

URL_LESS = "https://www.slideshare.net/angelucmex/global-warming-2373190?qid=8f04572c-48df-4f53-b2b0-0eb71021931c&v=&b=&from_search=1"
URL="https://www.slideshare.net/tusharpanda88/python-basics-59573634?qid=03cb80ee-36f0-4241-a516-454ad64808a8&v=&b=&from_search=5"
r = requests.get(URL_LESS)

soup = BeautifulSoup(r.content, "html5lib")

imgs = soup.find_all('img', class_="slide-image")
imgSRC = [x.get("srcset").split(',')[0].strip().split(' ')[0].split('?')[0] for x in imgs]

imagesJPG = []
for img in imgSRC:
    im = requests.get(img)
    f = io.BytesIO(im.content)
    imgJPG = Image.open(f)
    imagesJPG.append(imgJPG)

imagesJPG[0].save(f"{soup.title.string}.pdf",save_all=True, append_images=imagesJPG[1:])

试试把URL_LESS改成URL，你就会明白了。

这是回溯

Traceback (most recent call last):
  File "D:\Work\py\scrapingScripts\tkinter\env\lib\site-packages\urllib3\connection.py", line 174, in _new_conn
    conn = connection.create_connection(
  File "D:\Work\py\scrapingScripts\tkinter\env\lib\site-packages\urllib3\util\connection.py", line 95, in create_connection
    raise err
  File "D:\Work\py\scrapingScripts\tkinter\env\lib\site-packages\urllib3\util\connection.py", line 85, in create_connection
    sock.connect(sa)
TimeoutError: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "D:\Work\py\scrapingScripts\tkinter\env\lib\site-packages\urllib3\connectionpool.py", line 703, in urlopen
    httplib_response = self._make_request(
  File "D:\Work\py\scrapingScripts\tkinter\env\lib\site-packages\urllib3\connectionpool.py", line 386, in _make_request
    self._validate_conn(conn)
  File "D:\Work\py\scrapingScripts\tkinter\env\lib\site-packages\urllib3\connectionpool.py", line 1040, in _validate_conn
    conn.connect()
  File "D:\Work\py\scrapingScripts\tkinter\env\lib\site-packages\urllib3\connection.py", line 358, in connect
    conn = self._new_conn()
  File "D:\Work\py\scrapingScripts\tkinter\env\lib\site-packages\urllib3\connection.py", line 186, in _new_conn
    raise NewConnectionError(
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x00000259643FF820>: Failed to establish a new connection: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "D:\Work\py\scrapingScripts\tkinter\env\lib\site-packages\requests\adapters.py", line 440, in send
    resp = conn.urlopen(
  File "D:\Work\py\scrapingScripts\tkinter\env\lib\site-packages\urllib3\connectionpool.py", line 785, in urlopen
    retries = retries.increment(
  File "D:\Work\py\scrapingScripts\tkinter\env\lib\site-packages\urllib3\util\retry.py", line 592, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='image.slidesharecdn.com', port=443): Max retries exceeded with url: /pythonbasics-160315100530/85/python-basics-8-320.jpg (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x00000259643FF820>: Failed to establish a new connection: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "d:\Work\py\scrapingScripts\slideshare\main.py", line 16, in <module>
    im = requests.get(img)
  File "D:\Work\py\scrapingScripts\tkinter\env\lib\site-packages\requests\api.py", line 75, in get
    return request('get', url, params=params, **kwargs)
  File "D:\Work\py\scrapingScripts\tkinter\env\lib\site-packages\requests\api.py", line 61, in request
    return session.request(method=method, url=url, **kwargs)
  File "D:\Work\py\scrapingScripts\tkinter\env\lib\site-packages\requests\sessions.py", line 529, in request
    resp = self.send(prep, **send_kwargs)
  File "D:\Work\py\scrapingScripts\tkinter\env\lib\site-packages\requests\sessions.py", line 645, in send
    r = adapter.send(request, **kwargs)
  File "D:\Work\py\scrapingScripts\tkinter\env\lib\site-packages\requests\adapters.py", line 519, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='image.slidesharecdn.com', port=443): Max retries exceeded with url: /pythonbasics-160315100530/85/python-basics-8-320.jpg (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x00000259643FF820>: Failed to establish a new connection: [WinError 10060] A connection attempt failed because the connected party did 
not properly respond after a period of time, or established connection failed because connected host has failed to respond'))

Answer 1

在使用 URL 和 URL_LESS 时，该脚本对我来说工作得很好，所以你的互联网可能是这里的罪魁祸首。

我的猜测是：

您正在使用 slow/inconsistent 互联网。
Slideshare 正在将您的 IP 列入黑名单/web-agent可能是为了保护 DDOS。（不太可能）
您使用的是 ipv6，这对我来说是这类情况的罪魁祸首，请尝试将您的网络切换为仅使用 ipv4。

当涉及到请求时，我个人在相当长的一段时间内使用它来抓取相当大量的数据，所以我可以说这是一个很棒的库

TimeoutError: Large amount of data in requests python

TimeoutError: Large amount of data in requests python

python

automation

beautifulsoup

web-scraping

python-3.x