Python 抓取:错误 54 'Connection reset by peer'
Python scraping: Error 54 'Connection reset by peer'
我编写了简单的脚本来从多个网站获取 html。尽管直到昨天我对脚本没有任何问题。它突然开始抛出下面的异常。
Traceback (most recent call last):
File "crowling.py", line 45, in <module>
result = requests.get(url)
File "/Users/gen/.pyenv/versions/3.7.1/lib/python3.7/site-packages/requests/api.py", line 76, in get
return request('get', url, params=params, **kwargs)
File "/Users/gen/.pyenv/versions/3.7.1/lib/python3.7/site-packages/requests/api.py", line 61, in request
return session.request(method=method, url=url, **kwargs)
File "/Users/gen/.pyenv/versions/3.7.1/lib/python3.7/site-packages/requests/sessions.py", line 530, in request
resp = self.send(prep, **send_kwargs)
File "/Users/gen/.pyenv/versions/3.7.1/lib/python3.7/site-packages/requests/sessions.py", line 685, in send
r.content
File "/Users/gen/.pyenv/versions/3.7.1/lib/python3.7/site-packages/requests/models.py", line 829, in content
self._content = b''.join(self.iter_content(CONTENT_CHUNK_SIZE)) or b''
File "/Users/gen/.pyenv/versions/3.7.1/lib/python3.7/site-packages/requests/models.py", line 754, in generate
raise ChunkedEncodingError(e)
requests.exceptions.ChunkedEncodingError: ("Connection broken: ConnectionResetError(54, 'Connection reset by peer')", ConnectionResetError(54, 'Connection reset by peer'))
脚本的主要部分是这样的。
c = 0
#urls is the list of urls as strings
for url in urls:
result = requests.get(url)
c += 1
with open('htmls/p{}.html'.format(c),'w',encoding='UTF-8') as f:
f.write(result.text)
列表网址由我的其他代码生成,我已检查网址是否正确。异常的时间也不是恒定的。有时它在抓取第 20 html 秒时停止,有时它会持续到第 80 秒然后停止。由于没有更改代码突然出现异常,我猜测异常是由于Internet连接引起的。然而,我想确保脚本稳定运行。这个错误有什么可能的原因吗?
如果您确定 URL 正确并且是间歇性连接问题,您可以在失败后重试连接:
c = 0
#urls is the list of urls as strings
for url in urls:
trycnt = 3 # max try cnt
while trycnt > 0:
try:
result = requests.get(url)
c += 1
with open('htmls/p{}.html'.format(c),'w',encoding='UTF-8') as f:
f.write(result.text)
trycnt = 0 # success
except ChunkedEncodingError as ex:
if trycnt <= 0: print("Failed to retrieve: " + url + "\n" + str(ex)) # done retrying
else: trycnt -= 1 # retry
time.sleep(0.5) # wait 1/2 second then retry
# go to next URL
我编写了简单的脚本来从多个网站获取 html。尽管直到昨天我对脚本没有任何问题。它突然开始抛出下面的异常。
Traceback (most recent call last):
File "crowling.py", line 45, in <module>
result = requests.get(url)
File "/Users/gen/.pyenv/versions/3.7.1/lib/python3.7/site-packages/requests/api.py", line 76, in get
return request('get', url, params=params, **kwargs)
File "/Users/gen/.pyenv/versions/3.7.1/lib/python3.7/site-packages/requests/api.py", line 61, in request
return session.request(method=method, url=url, **kwargs)
File "/Users/gen/.pyenv/versions/3.7.1/lib/python3.7/site-packages/requests/sessions.py", line 530, in request
resp = self.send(prep, **send_kwargs)
File "/Users/gen/.pyenv/versions/3.7.1/lib/python3.7/site-packages/requests/sessions.py", line 685, in send
r.content
File "/Users/gen/.pyenv/versions/3.7.1/lib/python3.7/site-packages/requests/models.py", line 829, in content
self._content = b''.join(self.iter_content(CONTENT_CHUNK_SIZE)) or b''
File "/Users/gen/.pyenv/versions/3.7.1/lib/python3.7/site-packages/requests/models.py", line 754, in generate
raise ChunkedEncodingError(e)
requests.exceptions.ChunkedEncodingError: ("Connection broken: ConnectionResetError(54, 'Connection reset by peer')", ConnectionResetError(54, 'Connection reset by peer'))
脚本的主要部分是这样的。
c = 0
#urls is the list of urls as strings
for url in urls:
result = requests.get(url)
c += 1
with open('htmls/p{}.html'.format(c),'w',encoding='UTF-8') as f:
f.write(result.text)
列表网址由我的其他代码生成,我已检查网址是否正确。异常的时间也不是恒定的。有时它在抓取第 20 html 秒时停止,有时它会持续到第 80 秒然后停止。由于没有更改代码突然出现异常,我猜测异常是由于Internet连接引起的。然而,我想确保脚本稳定运行。这个错误有什么可能的原因吗?
如果您确定 URL 正确并且是间歇性连接问题,您可以在失败后重试连接:
c = 0
#urls is the list of urls as strings
for url in urls:
trycnt = 3 # max try cnt
while trycnt > 0:
try:
result = requests.get(url)
c += 1
with open('htmls/p{}.html'.format(c),'w',encoding='UTF-8') as f:
f.write(result.text)
trycnt = 0 # success
except ChunkedEncodingError as ex:
if trycnt <= 0: print("Failed to retrieve: " + url + "\n" + str(ex)) # done retrying
else: trycnt -= 1 # retry
time.sleep(0.5) # wait 1/2 second then retry
# go to next URL