For loop web scraping a website brings up timeouterror, newconnectionerror and a requests.exceptions.ConnectionError
For loop web scraping a website brings up timeouterror, newconnectionerror and a requests.exceptions.ConnectionError
抱歉,我是 Python 和网络抓取的初学者。
我正在网络抓取 wugniu.com 以提取我输入的字符的读数。我制作了一个包含 10273 个字符的列表以格式化为 URL 并调出带有读数的页面,然后我使用 Requests 模块 return 源代码,然后 Beautiful Soup return 所有音频 ID(因为它们的字符串包含输入字符的读数 - 我无法使用 table 中出现的文本,因为它们是 svgs)。然后我尝试将字符及其读数输出到 out.txt.
# -*- coding: utf-8 -*-
import requests, time
from bs4 import BeautifulSoup
from requests.packages.urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
characters = [
#characters go here
]
output = open("out.txt", "a", encoding="utf-8")
tic = time.perf_counter()
for char in characters:
# Characters from the list are formatted into the url
url = "https://wugniu.com/search?char=%s&table=wenzhou" % char
page = requests.get(url, verify=False)
soup = BeautifulSoup(page.text, 'html.parser')
for audio_tag in soup.find_all('audio'):
audio_id = audio_tag.get('id').replace("0-","")
#output.write(char)
#output.write(" ")
#output.write(audio_id)
#output.write("\n")
print(i)
time.sleep(60)
output.close()
toc = time.perf_counter()
duration = int(toc) - int(tic)
print("Took %d seconds" % duration)
out.txt
是我尝试将结果输出到的输出文件。我测量了该过程用于测量性能的时间。
然而,在 50 次左右的循环之后,我在 cmd 中得到了这个:
Traceback (most recent call last):
File "C:\Users\[user]\Documents\wenzhou-ime\env\lib\site-packages\urllib3\connection.py", line 169, in _new_conn
conn = connection.create_connection(
File "C:\Users\[user]\Documents\wenzhou-ime\env\lib\site-packages\urllib3\util\connection.py", line 96, in create_connection
raise err
File "C:\Users\[user]\Documents\wenzhou-ime\env\lib\site-packages\urllib3\util\connection.py", line 86, in create_connection
sock.connect(sa)
TimeoutError: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File"C:\Users\[user]\Documents\wenzhou-ime\env\lib\site-packages\urllib3\connectionpool.py", line 699, in urlopen httplib_response = self._make_request(
File "C:\Users\[user]\Documents\wenzhou-ime\env\lib\site-packages\urllib3\connectionpool.py", line 382, in _make_request
self._validate_conn(conn)
File "C:\Users\[user]\Documents\wenzhou-ime\env\lib\site-packages\urllib3\connectionpool.py", line 1010, in _validate_conn
conn.connect()
File "C:\Users\[user]\Documents\wenzhou-ime\env\lib\site-packages\urllib3\connection.py", line 353, in connect
conn = self._new_conn()
File "C:\Users\[user]\Documents\wenzhou-ime\env\lib\site-packages\urllib3\connection.py", line 181, in _new_conn
raise NewConnectionError(urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x000002035D5F9040>: Failed to establish a new connection: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\[user]\Documents\wenzhou-ime\env\lib\site-packages\requests\adapters.py", line 439, in send
resp = conn.urlopen(
File "C:\Users\[user]\Documents\wenzhou-ime\env\lib\site-packages\urllib3\connectionpool.py", line 755, in urlopen
retries = retries.increment(
File "C:\Users\[user]\Documents\wenzhou-ime\env\lib\site-packages\urllib3\util\retry.py", line 573, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause)) urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='wugniu.com', port=443): Max retries exceeded with url: /search?char=%E8%87%B4&table=wenzhou (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x000002035D5F9040>: Failed to establish a new connection: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\[user]\Documents\wenzhou-ime\test.py", line 3282, in <module> page = requests.get(url, verify=False)
File "C:\Users\[user]\Documents\wenzhou-ime\env\lib\site-packages\requests\api.py", line 76, in get
return request('get', url, params=params, **kwargs) File "C:\Users\[user]\Documents\wenzhou-ime\env\lib\site-packages\requests\api.py", line 61, in request
return session.request(method=method, url=url, **kwargs) File "C:\Users\[user]\Documents\wenzhou-ime\env\lib\site-packages\requests\sessions.py", line 542, in request
resp = self.send(prep, **send_kwargs) File "C:\Users\[user]\Documents\wenzhou-ime\env\lib\site-packages\requests\sessions.py", line 655, in send
r = adapter.send(request, **kwargs) File "C:\Users\[user]\Documents\wenzhou-ime\env\lib\site-packages\requests\adapters.py", line 516, in send
raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='wugniu.com', port=443): Max retries exceeded with url: /search?char=%E8%87%B4&table=wenzhou (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x000002035D5F9040>: Failed to establish a new connection: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond'))
我试图通过添加 time.sleep(60)
来解决这个问题,但错误仍然发生。当我昨天制作这个脚本时,我能够 运行 它有一个最多 1500 个字符的列表,没有错误。有人可以帮我解决这个问题吗?谢谢。
这是完全正常的预期行为。因为这与 Chicken-Egg
问题有关。
假设你打开 Firefox
浏览器,然后你打开 google.com
,然后你关闭它并重复这个循环!
That's count as a DDOS
attack and all modern servers will block your requests and flag your IP as that's really hurting their bandwidth!
合乎逻辑且正确的方法是使用相同的 session instead of keep creating multiple sessions. As that's will not be shown under TCP-Syn Flood Flag. Check Legal tcp-flags。
另一方面,您确实需要使用 Context-Manager 而不是记住您的变量。
Example:
output = open("out.txt", "a", encoding="utf-8")
output.close()
可以通过 With
处理如下:
with open('out.txt', 'w', newline='', encoding='utf-8') as output:
# here you can do your operation.
and once you be out the with
then your file will be closed automatically!
此外,考虑使用新的 format string
而不是旧的
url = "https://wugniu.com/search?char=%s&table=wenzhou" % char
可以是:
"https://wugniu.com/search?char={}&table=wenzhou".format(char)
我不会在这里使用专业代码,我已将其简化为您可以理解这个概念。
Pay attention to how I picked up the desired element
and how I wrote it to the file. and the difference speed from lxml
and html.parser
can be found here
import requests
from bs4 import BeautifulSoup
import urllib3
urllib3.disable_warnings()
def main(url, chars):
with open('result.txt', 'w', newline='', encoding='utf-8') as f, requests.Session() as req:
req.verify = False
for char in chars:
print(f"Extracting {char}")
r = req.get(url.format(char))
soup = BeautifulSoup(r.text, 'lxml')
target = [x['id'][2:] for x in soup.select('audio[id^="0-"]')]
print(target)
f.write(f'{char}\n{str(target)}\n')
if __name__ == "__main__":
chars = ['核']
main('https://wugniu.com/search?char={}&table=wenzhou', chars)
也可以按照 Python Dry Principle 设置 req.verify = False
而不是每次请求都设置 verify = False
。
Next Step: You should take a look at Threading or AsyncProgrammingg in order to enhance your code operation time as in real-world projects we aren't using a normal for loop (count as really slow) while you can send a bunch of URLs and wait for a response.
抱歉,我是 Python 和网络抓取的初学者。
我正在网络抓取 wugniu.com 以提取我输入的字符的读数。我制作了一个包含 10273 个字符的列表以格式化为 URL 并调出带有读数的页面,然后我使用 Requests 模块 return 源代码,然后 Beautiful Soup return 所有音频 ID(因为它们的字符串包含输入字符的读数 - 我无法使用 table 中出现的文本,因为它们是 svgs)。然后我尝试将字符及其读数输出到 out.txt.
# -*- coding: utf-8 -*-
import requests, time
from bs4 import BeautifulSoup
from requests.packages.urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
characters = [
#characters go here
]
output = open("out.txt", "a", encoding="utf-8")
tic = time.perf_counter()
for char in characters:
# Characters from the list are formatted into the url
url = "https://wugniu.com/search?char=%s&table=wenzhou" % char
page = requests.get(url, verify=False)
soup = BeautifulSoup(page.text, 'html.parser')
for audio_tag in soup.find_all('audio'):
audio_id = audio_tag.get('id').replace("0-","")
#output.write(char)
#output.write(" ")
#output.write(audio_id)
#output.write("\n")
print(i)
time.sleep(60)
output.close()
toc = time.perf_counter()
duration = int(toc) - int(tic)
print("Took %d seconds" % duration)
out.txt
是我尝试将结果输出到的输出文件。我测量了该过程用于测量性能的时间。
然而,在 50 次左右的循环之后,我在 cmd 中得到了这个:
Traceback (most recent call last):
File "C:\Users\[user]\Documents\wenzhou-ime\env\lib\site-packages\urllib3\connection.py", line 169, in _new_conn
conn = connection.create_connection(
File "C:\Users\[user]\Documents\wenzhou-ime\env\lib\site-packages\urllib3\util\connection.py", line 96, in create_connection
raise err
File "C:\Users\[user]\Documents\wenzhou-ime\env\lib\site-packages\urllib3\util\connection.py", line 86, in create_connection
sock.connect(sa)
TimeoutError: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File"C:\Users\[user]\Documents\wenzhou-ime\env\lib\site-packages\urllib3\connectionpool.py", line 699, in urlopen httplib_response = self._make_request(
File "C:\Users\[user]\Documents\wenzhou-ime\env\lib\site-packages\urllib3\connectionpool.py", line 382, in _make_request
self._validate_conn(conn)
File "C:\Users\[user]\Documents\wenzhou-ime\env\lib\site-packages\urllib3\connectionpool.py", line 1010, in _validate_conn
conn.connect()
File "C:\Users\[user]\Documents\wenzhou-ime\env\lib\site-packages\urllib3\connection.py", line 353, in connect
conn = self._new_conn()
File "C:\Users\[user]\Documents\wenzhou-ime\env\lib\site-packages\urllib3\connection.py", line 181, in _new_conn
raise NewConnectionError(urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x000002035D5F9040>: Failed to establish a new connection: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\[user]\Documents\wenzhou-ime\env\lib\site-packages\requests\adapters.py", line 439, in send
resp = conn.urlopen(
File "C:\Users\[user]\Documents\wenzhou-ime\env\lib\site-packages\urllib3\connectionpool.py", line 755, in urlopen
retries = retries.increment(
File "C:\Users\[user]\Documents\wenzhou-ime\env\lib\site-packages\urllib3\util\retry.py", line 573, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause)) urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='wugniu.com', port=443): Max retries exceeded with url: /search?char=%E8%87%B4&table=wenzhou (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x000002035D5F9040>: Failed to establish a new connection: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\[user]\Documents\wenzhou-ime\test.py", line 3282, in <module> page = requests.get(url, verify=False)
File "C:\Users\[user]\Documents\wenzhou-ime\env\lib\site-packages\requests\api.py", line 76, in get
return request('get', url, params=params, **kwargs) File "C:\Users\[user]\Documents\wenzhou-ime\env\lib\site-packages\requests\api.py", line 61, in request
return session.request(method=method, url=url, **kwargs) File "C:\Users\[user]\Documents\wenzhou-ime\env\lib\site-packages\requests\sessions.py", line 542, in request
resp = self.send(prep, **send_kwargs) File "C:\Users\[user]\Documents\wenzhou-ime\env\lib\site-packages\requests\sessions.py", line 655, in send
r = adapter.send(request, **kwargs) File "C:\Users\[user]\Documents\wenzhou-ime\env\lib\site-packages\requests\adapters.py", line 516, in send
raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='wugniu.com', port=443): Max retries exceeded with url: /search?char=%E8%87%B4&table=wenzhou (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x000002035D5F9040>: Failed to establish a new connection: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond'))
我试图通过添加 time.sleep(60)
来解决这个问题,但错误仍然发生。当我昨天制作这个脚本时,我能够 运行 它有一个最多 1500 个字符的列表,没有错误。有人可以帮我解决这个问题吗?谢谢。
这是完全正常的预期行为。因为这与 Chicken-Egg
问题有关。
假设你打开 Firefox
浏览器,然后你打开 google.com
,然后你关闭它并重复这个循环!
That's count as a
DDOS
attack and all modern servers will block your requests and flag your IP as that's really hurting their bandwidth!
合乎逻辑且正确的方法是使用相同的 session instead of keep creating multiple sessions. As that's will not be shown under TCP-Syn Flood Flag. Check Legal tcp-flags。
另一方面,您确实需要使用 Context-Manager 而不是记住您的变量。
Example:
output = open("out.txt", "a", encoding="utf-8")
output.close()
可以通过 With
处理如下:
with open('out.txt', 'w', newline='', encoding='utf-8') as output:
# here you can do your operation.
and once you be out the
with
then your file will be closed automatically!
此外,考虑使用新的 format string
而不是旧的
url = "https://wugniu.com/search?char=%s&table=wenzhou" % char
可以是:
"https://wugniu.com/search?char={}&table=wenzhou".format(char)
我不会在这里使用专业代码,我已将其简化为您可以理解这个概念。
Pay attention to how I picked up the desired
element
and how I wrote it to the file. and the difference speed fromlxml
andhtml.parser
can be found here
import requests
from bs4 import BeautifulSoup
import urllib3
urllib3.disable_warnings()
def main(url, chars):
with open('result.txt', 'w', newline='', encoding='utf-8') as f, requests.Session() as req:
req.verify = False
for char in chars:
print(f"Extracting {char}")
r = req.get(url.format(char))
soup = BeautifulSoup(r.text, 'lxml')
target = [x['id'][2:] for x in soup.select('audio[id^="0-"]')]
print(target)
f.write(f'{char}\n{str(target)}\n')
if __name__ == "__main__":
chars = ['核']
main('https://wugniu.com/search?char={}&table=wenzhou', chars)
也可以按照 Python Dry Principle 设置 req.verify = False
而不是每次请求都设置 verify = False
。
Next Step: You should take a look at Threading or AsyncProgrammingg in order to enhance your code operation time as in real-world projects we aren't using a normal for loop (count as really slow) while you can send a bunch of URLs and wait for a response.