为什么 Selenium webdriver 可以打开标准 Python urlopen 函数无法打开的 URL?

Why can a Selenium webdriver open a URL that the standard Python urlopen function cannot?

我在Python 3.8的标准库中遇到了无法用urllib.request.urlopen打开的URL。幸运的是,我碰巧在用 Selenium 做实验,发现 selenium.webdriver.Chrome 可以打开同样的 URL。我想明白为什么会这样。

这是一个最小的例子:

from urllib.request import urlopen, HTTPError
from selenium import webdriver

urls = ("https://yahoo.com",
        "https://finance.yahoo.com/quote/IWM?p=IWM",
        "https://finance.yahoo.com/quote/IWM/options?p=IWM&straddle=false&date=1640908800")

for url in urls:
    print(f"\nopening {url}:")
    try:
        with urlopen(url) as f:
            lines = f.readlines()
        n = len(lines)
        print(f"retrieved {n} lines.")
    except HTTPError as e:
        print(e)

print(f"\nretrying {url} with Selenium webdriver:")
options = webdriver.chrome.options.Options()
options.add_argument("--headless")
driver = webdriver.Chrome(options=options)
driver.get(url)
lines = driver.page_source.split("\n")
n = len(lines)
print(f"retrieved {n} lines.")
driver.close()

这是它的输出:

opening https://yahoo.com:
retrieved 1805 lines.

opening https://finance.yahoo.com/quote/IWM?p=IWM:
retrieved 655 lines.

opening https://finance.yahoo.com/quote/IWM/options?p=IWM&straddle=false&date=1640908800:
HTTP Error 404: Not Found

retrying https://finance.yahoo.com/quote/IWM/options?p=IWM&straddle=false&date=1640908800 with Selenium webdriver:
retrieved 572 lines.

一些网站根据用户代理限制访问。您可以尝试为您的请求提供用户代理:

from urllib.request import urlopen, HTTPError, Request

req = Request(
    "https://finance.yahoo.com/quote/IWM/options?p=IWM&straddle=false&date=1640908800", 
    data=None, 
    headers={
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
    }
)

try:
    with urlopen(req) as f:
        lines = f.readlines()
    n = len(lines)
    print(f"retrieved {n} lines.")
except HTTPError as e:
    print(e)