为什么 urlopen 不适用于某些网站?
Why is urlopen not working for certain websites?
我是 python 的新手,我正在尝试从客户的网站上抓取一些基本数据。我已经在其他网站上尝试过这种完全相同的方法,并收到了预期的结果。这是我目前所拥有的:
from urllib.request import urlopen
from bs4 import BeautifulSoup
main_url = 'https://www.grainger.com/category/pipe-hose-tube-fittings/hose-products/hose-fittings-couplings/cam-groove-fittings-gaskets/metal-cam-groove-fittings/stainless-steel-cam-groove-fittings'
uClient = urllib.request.urlopen(main_url)
main_html = uClient.read()
uClient.close()
即使是这个读取网站的简单调用也会导致看似超时的错误。正如我所说,我已经在其他网站上成功地使用了完全相同的代码。错误是:
Traceback (most recent call last):
File "Pricing_Tool.py", line 6, in <module>
uClient = uReq(main_url)
File "C:\Users\Brian Knoll\anaconda3\lib\urllib\request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "C:\Users\Brian Knoll\anaconda3\lib\urllib\request.py", line 525, in open
response = self._open(req, data)
File "C:\Users\Brian Knoll\anaconda3\lib\urllib\request.py", line 543, in _open
'_open', req)
File "C:\Users\Brian Knoll\anaconda3\lib\urllib\request.py", line 503, in _call_chain
result = func(*args)
File "C:\Users\Brian Knoll\anaconda3\lib\urllib\request.py", line 1362, in https_open
context=self._context, check_hostname=self._check_hostname)
File "C:\Users\Brian Knoll\anaconda3\lib\urllib\request.py", line 1322, in do_open
r = h.getresponse()
File "C:\Users\Brian Knoll\anaconda3\lib\http\client.py", line 1344, in getresponse
response.begin()
File "C:\Users\Brian Knoll\anaconda3\lib\http\client.py", line 306, in begin
version, status, reason = self._read_status()
File "C:\Users\Brian Knoll\anaconda3\lib\http\client.py", line 267, in _read_status
line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
File "C:\Users\Brian Knoll\anaconda3\lib\socket.py", line 589, in readinto
return self._sock.recv_into(b)
File "C:\Users\Brian Knoll\anaconda3\lib\ssl.py", line 1071, in recv_into
return self.read(nbytes, buffer)
File "C:\Users\Brian Knoll\anaconda3\lib\ssl.py", line 929, in read
return self._sslobj.read(len, buffer)
TimeoutError: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond
有没有可能是这个网站太大了,无法处理?
任何帮助将不胜感激。谢谢!
通常网站 return 对通过 requests
发送请求的响应。但是有些网站需要一些特定的 headers,例如 User-Agent、Cookie 等。这就是这样一个网站。您已发送 User-Agent
以便网站看到请求来自浏览器。以下代码应 return 响应代码 200.
import requests
headers = {"User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36"}
res = requests.get("https://www.grainger.com/category/pipe-hose-tube-fittings/hose-products/hose-fittings-couplings/cam-groove-fittings-gaskets/metal-cam-groove-fittings/stainless-steel-cam-groove-fittings", headers=headers)
print(res.status_code)
更新:
from bs4 import BeautifulSoup
soup = BeautifulSoup(res.text, "lxml")
print(soup.find_all("a"))
这会给出所有锚标签
我是 python 的新手,我正在尝试从客户的网站上抓取一些基本数据。我已经在其他网站上尝试过这种完全相同的方法,并收到了预期的结果。这是我目前所拥有的:
from urllib.request import urlopen
from bs4 import BeautifulSoup
main_url = 'https://www.grainger.com/category/pipe-hose-tube-fittings/hose-products/hose-fittings-couplings/cam-groove-fittings-gaskets/metal-cam-groove-fittings/stainless-steel-cam-groove-fittings'
uClient = urllib.request.urlopen(main_url)
main_html = uClient.read()
uClient.close()
即使是这个读取网站的简单调用也会导致看似超时的错误。正如我所说,我已经在其他网站上成功地使用了完全相同的代码。错误是:
Traceback (most recent call last):
File "Pricing_Tool.py", line 6, in <module>
uClient = uReq(main_url)
File "C:\Users\Brian Knoll\anaconda3\lib\urllib\request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "C:\Users\Brian Knoll\anaconda3\lib\urllib\request.py", line 525, in open
response = self._open(req, data)
File "C:\Users\Brian Knoll\anaconda3\lib\urllib\request.py", line 543, in _open
'_open', req)
File "C:\Users\Brian Knoll\anaconda3\lib\urllib\request.py", line 503, in _call_chain
result = func(*args)
File "C:\Users\Brian Knoll\anaconda3\lib\urllib\request.py", line 1362, in https_open
context=self._context, check_hostname=self._check_hostname)
File "C:\Users\Brian Knoll\anaconda3\lib\urllib\request.py", line 1322, in do_open
r = h.getresponse()
File "C:\Users\Brian Knoll\anaconda3\lib\http\client.py", line 1344, in getresponse
response.begin()
File "C:\Users\Brian Knoll\anaconda3\lib\http\client.py", line 306, in begin
version, status, reason = self._read_status()
File "C:\Users\Brian Knoll\anaconda3\lib\http\client.py", line 267, in _read_status
line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
File "C:\Users\Brian Knoll\anaconda3\lib\socket.py", line 589, in readinto
return self._sock.recv_into(b)
File "C:\Users\Brian Knoll\anaconda3\lib\ssl.py", line 1071, in recv_into
return self.read(nbytes, buffer)
File "C:\Users\Brian Knoll\anaconda3\lib\ssl.py", line 929, in read
return self._sslobj.read(len, buffer)
TimeoutError: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond
有没有可能是这个网站太大了,无法处理? 任何帮助将不胜感激。谢谢!
通常网站 return 对通过 requests
发送请求的响应。但是有些网站需要一些特定的 headers,例如 User-Agent、Cookie 等。这就是这样一个网站。您已发送 User-Agent
以便网站看到请求来自浏览器。以下代码应 return 响应代码 200.
import requests
headers = {"User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36"}
res = requests.get("https://www.grainger.com/category/pipe-hose-tube-fittings/hose-products/hose-fittings-couplings/cam-groove-fittings-gaskets/metal-cam-groove-fittings/stainless-steel-cam-groove-fittings", headers=headers)
print(res.status_code)
更新:
from bs4 import BeautifulSoup
soup = BeautifulSoup(res.text, "lxml")
print(soup.find_all("a"))
这会给出所有锚标签