Python - 打印 HTML 页面对某些站点给出空响应

Python - Printing an HTML page give empty response for some sites

我想从站点 (whoscored.com) 打印 html 页面。我可以打印,但如果我尝试使用子域,则会给出空响应:

import urllib2
htmlfile =urllib2.urlopen("http://whoscored.com/Matches/829663/Live/")
html = htmlfile.read()    
print html

首先,是的,您提供的页面不存在

此外,您需要提供 User-Agent header 以获取和查看实际的 404 HTML 响应。使用 requests library 的示例:

>>> import requests
>>> 
>>> headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.99 Safari/537.36'}
>>> response = requests.get("http://whoscored.com/829652/Live/", headers=headers)
>>> print response.content
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=utf-8" /> 
<meta http-equiv="content-language" content="en" />
<title>WhoScored.com</title>
</head>
<body style="padding: 20px; font-family:Arial,Helvetica,sans-serif; background-color:#222222;">
    <div style="margin:0 auto; padding: 40px 20px; width:560px; background-color:#fff;">
        The page you requested does not exist in <a href="http://www.whoscored.com">WhoScored.com</a>
    </div>
</body>
</html>