python urllib，returns 特定 url 的空页面

Question

我在使用 urllib 的特定链接时遇到问题。下面是我使用的代码示例：

from urllib.request import Request, urlopen
import re

url = ""
req = Request(url)
html_page = urlopen(req).read()

print(len(html_page))

以下是我通过两个链接获得的结果：

url = "https://www.dafont.com"
Length: 0

url = "https://www.whosebug.com"
Length: 196673

有人知道为什么会这样吗？

Answer 1

尝试使用。你会得到回应。某些网站是安全的，并且只响应某些用户代理。

from urllib.request import Request, urlopen

url = "https://www.dafont.com"
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36"}
req = Request(url, headers=headers)
html_page = urlopen(req).read()

print(len(html_page))

Answer 2

这是作者dafont网站强加的限制。

默认情况下，urllib 发送 User-Agent header of urllib/VVV，其中 VVV 是 urllib 版本号。有关更多信息，请参阅：https://docs.python.org/3/library/urllib.request.html 许多网站管理员保护自己免受爬虫的侵害。他们解析 User-Agent header。所以当他们遇到像 urllib/VVV 这样的 User-Agent header 时，他们认为这是一个爬虫。

测试 HEAD 方法：

$ curl -A "Python-urllib/2.6" -I https://www.dafont.com
HTTP/1.1 200 OK
Date: Sun, 13 Jun 2021 15:11:53 GMT
Server: Apache
Strict-Transport-Security: max-age=63072000; includeSubDomains; preload
Content-Type: text/html

$ curl -I https://www.dafont.com
HTTP/1.1 200 OK
Date: Sun, 13 Jun 2021 15:12:02 GMT
Server: Apache
Strict-Transport-Security: max-age=63072000; includeSubDomains; preload
Set-Cookie: PHPSESSID=dcauh0dp1antb7eps1smfg2a76; path=/
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
Content-Type: text/html

测试 GET 方法：

$ curl -sSL -A "Python-urllib/2.6" https://www.dafont.com | wc -c
       0

$ curl -sSL https://www.dafont.com | wc -c
   18543

python urllib，returns 特定 url 的空页面

python urllib, returns empty page for specific urls

python

urllib