为什么 Selenium webdriver 可以打开标准 Python urlopen 函数无法打开的 URL?
Why can a Selenium webdriver open a URL that the standard Python urlopen function cannot?
我在Python 3.8的标准库中遇到了无法用urllib.request.urlopen
打开的URL。幸运的是,我碰巧在用 Selenium 做实验,发现 selenium.webdriver.Chrome
可以打开同样的 URL。我想明白为什么会这样。
这是一个最小的例子:
from urllib.request import urlopen, HTTPError
from selenium import webdriver
urls = ("https://yahoo.com",
"https://finance.yahoo.com/quote/IWM?p=IWM",
"https://finance.yahoo.com/quote/IWM/options?p=IWM&straddle=false&date=1640908800")
for url in urls:
print(f"\nopening {url}:")
try:
with urlopen(url) as f:
lines = f.readlines()
n = len(lines)
print(f"retrieved {n} lines.")
except HTTPError as e:
print(e)
print(f"\nretrying {url} with Selenium webdriver:")
options = webdriver.chrome.options.Options()
options.add_argument("--headless")
driver = webdriver.Chrome(options=options)
driver.get(url)
lines = driver.page_source.split("\n")
n = len(lines)
print(f"retrieved {n} lines.")
driver.close()
这是它的输出:
opening https://yahoo.com:
retrieved 1805 lines.
opening https://finance.yahoo.com/quote/IWM?p=IWM:
retrieved 655 lines.
opening https://finance.yahoo.com/quote/IWM/options?p=IWM&straddle=false&date=1640908800:
HTTP Error 404: Not Found
retrying https://finance.yahoo.com/quote/IWM/options?p=IWM&straddle=false&date=1640908800 with Selenium webdriver:
retrieved 572 lines.
一些网站根据用户代理限制访问。您可以尝试为您的请求提供用户代理:
from urllib.request import urlopen, HTTPError, Request
req = Request(
"https://finance.yahoo.com/quote/IWM/options?p=IWM&straddle=false&date=1640908800",
data=None,
headers={
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
}
)
try:
with urlopen(req) as f:
lines = f.readlines()
n = len(lines)
print(f"retrieved {n} lines.")
except HTTPError as e:
print(e)
我在Python 3.8的标准库中遇到了无法用urllib.request.urlopen
打开的URL。幸运的是,我碰巧在用 Selenium 做实验,发现 selenium.webdriver.Chrome
可以打开同样的 URL。我想明白为什么会这样。
这是一个最小的例子:
from urllib.request import urlopen, HTTPError
from selenium import webdriver
urls = ("https://yahoo.com",
"https://finance.yahoo.com/quote/IWM?p=IWM",
"https://finance.yahoo.com/quote/IWM/options?p=IWM&straddle=false&date=1640908800")
for url in urls:
print(f"\nopening {url}:")
try:
with urlopen(url) as f:
lines = f.readlines()
n = len(lines)
print(f"retrieved {n} lines.")
except HTTPError as e:
print(e)
print(f"\nretrying {url} with Selenium webdriver:")
options = webdriver.chrome.options.Options()
options.add_argument("--headless")
driver = webdriver.Chrome(options=options)
driver.get(url)
lines = driver.page_source.split("\n")
n = len(lines)
print(f"retrieved {n} lines.")
driver.close()
这是它的输出:
opening https://yahoo.com:
retrieved 1805 lines.
opening https://finance.yahoo.com/quote/IWM?p=IWM:
retrieved 655 lines.
opening https://finance.yahoo.com/quote/IWM/options?p=IWM&straddle=false&date=1640908800:
HTTP Error 404: Not Found
retrying https://finance.yahoo.com/quote/IWM/options?p=IWM&straddle=false&date=1640908800 with Selenium webdriver:
retrieved 572 lines.
一些网站根据用户代理限制访问。您可以尝试为您的请求提供用户代理:
from urllib.request import urlopen, HTTPError, Request
req = Request(
"https://finance.yahoo.com/quote/IWM/options?p=IWM&straddle=false&date=1640908800",
data=None,
headers={
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
}
)
try:
with urlopen(req) as f:
lines = f.readlines()
n = len(lines)
print(f"retrieved {n} lines.")
except HTTPError as e:
print(e)