http.client.IncompleteRead Python3 中的错误

Question

我正在尝试用 beautifulsoup4 和 python3 抓取 really long web page。由于网站的大小，http.client 在我尝试在网站中搜索内容时抛出错误：

File "/anaconda3/lib/python3.6/http/client.py", line 456, in read return self._readall_chunked() File "/anaconda3/lib/python3.6/http/client.py", line 570, in _readall_chunked raise IncompleteRead(b''.join(value)) http.client.IncompleteRead: IncompleteRead(16109 bytes read)

有什么办法可以解决这个错误吗？

Answer 1

作为 http.client tell you right at the top, this is a very low-level library, meant primarily to support urllib 的文档，并且：

See also The Requests package is recommended for a higher-level HTTP client interface.

如果你能conda install requests或pip install requests，你的问题就变得微不足道了：

import requests
req = requests.get('https://www.worldcubeassociation.org/results/events.php?eventId=222&regionId=&years=&show=All%2BPersons&average=Average')
soup = BeautifulSoup(req.text, 'lxml')

如果您无法安装第三方库，可以解决此问题，但实际上不受支持，而且并不容易。 http.client 中的块处理代码 None 是 public 或已记录，但文档会 link 你到 the source，在那里你可以看到私有方法.请特别注意 read 调用名为 _readall_chunked 的方法，该方法循环调用 _get_chunk_left 上的 _safe_read 方法。该 _safe_read 方法是您需要替换的代码（例如，通过子类化 HTTPResponse，或对其进行猴子修补）以解决此问题。这可能不会像使用高级库那样简单或有趣。

http.client.IncompleteRead Python3 中的错误

http.client.IncompleteRead error in Python3

python

beautifulsoup

python-3.x

http.client