Beautifulsoup 无法读取页面
Beautifulsoup fail to read page
我正在尝试以下操作:
from urllib2 import urlopen
from BeautifulSoup import BeautifulSoup
url = 'http://propaccess.traviscad.org/clientdb/Property.aspx?prop_id=312669'
soup = BeautifulSoup(urlopen(url).read())
print soup
上面的print
语句显示如下:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<meta http-equiv="Content-type" content="text/html;charset=utf-8" />
<title>Travis Property Search</title>
<style type="text/css">
body { text-align: center; padding: 150px; }
h1 { font-size: 50px; }
body { font: 20px Helvetica, sans-serif; color: #333; }
#article { display: block; text-align: left; width: 650px; margin: 0 auto; }
a { color: #dc8100; text-decoration: none; }
a:hover { color: #333; text-decoration: none; }
</style>
</head>
<body>
<div id="article">
<h1>Please try again</h1>
<div>
<p>Sorry for the inconvenience but your session has either timed out or the server is busy handling other requests. You may visit us on the the following website for information, otherwise please retry your search again shortly:<br /><br />
<a href="http://www.traviscad.org/">Travis Central Appraisal District Website</a> </p>
<p><b><a href="http://propaccess.traviscad.org/clientdb/?cid=1">Click here to reload the property search to try again</a></b></p>
</div>
</div>
</body>
</html>
我可以通过同一台计算机上的浏览器访问 url,因此服务器绝对不会阻止我的 IP。我不明白我的代码有什么问题?
您需要先获取一些cookie,然后才能访问url。
虽然这可以用 urllib2
和 CookieJar
来完成,但我建议 requests
:
import requests
from BeautifulSoup import BeautifulSoup
url1 = 'http://propaccess.traviscad.org/clientdb/?cid=1'
url = 'http://propaccess.traviscad.org/clientdb/Property.aspx?prop_id=312669'
ses = requests.Session()
ses.get(url1)
soup = BeautifulSoup(ses.get(url).content)
print soup.prettify()
请注意,requests
不是标准库,您必须安装它。
如果你想使用 urllib2
:
import urllib2
from cookielib import CookieJar
url1 = 'http://propaccess.traviscad.org/clientdb/?cid=1'
url = 'http://propaccess.traviscad.org/clientdb/Property.aspx?prop_id=312669'
cj = CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
opener.open(url1)
soup = BeautifulSoup(opener.open(url).read())
print soup.prettify()
我正在尝试以下操作:
from urllib2 import urlopen
from BeautifulSoup import BeautifulSoup
url = 'http://propaccess.traviscad.org/clientdb/Property.aspx?prop_id=312669'
soup = BeautifulSoup(urlopen(url).read())
print soup
上面的print
语句显示如下:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<meta http-equiv="Content-type" content="text/html;charset=utf-8" />
<title>Travis Property Search</title>
<style type="text/css">
body { text-align: center; padding: 150px; }
h1 { font-size: 50px; }
body { font: 20px Helvetica, sans-serif; color: #333; }
#article { display: block; text-align: left; width: 650px; margin: 0 auto; }
a { color: #dc8100; text-decoration: none; }
a:hover { color: #333; text-decoration: none; }
</style>
</head>
<body>
<div id="article">
<h1>Please try again</h1>
<div>
<p>Sorry for the inconvenience but your session has either timed out or the server is busy handling other requests. You may visit us on the the following website for information, otherwise please retry your search again shortly:<br /><br />
<a href="http://www.traviscad.org/">Travis Central Appraisal District Website</a> </p>
<p><b><a href="http://propaccess.traviscad.org/clientdb/?cid=1">Click here to reload the property search to try again</a></b></p>
</div>
</div>
</body>
</html>
我可以通过同一台计算机上的浏览器访问 url,因此服务器绝对不会阻止我的 IP。我不明白我的代码有什么问题?
您需要先获取一些cookie,然后才能访问url。
虽然这可以用 urllib2
和 CookieJar
来完成,但我建议 requests
:
import requests
from BeautifulSoup import BeautifulSoup
url1 = 'http://propaccess.traviscad.org/clientdb/?cid=1'
url = 'http://propaccess.traviscad.org/clientdb/Property.aspx?prop_id=312669'
ses = requests.Session()
ses.get(url1)
soup = BeautifulSoup(ses.get(url).content)
print soup.prettify()
请注意,requests
不是标准库,您必须安装它。
如果你想使用 urllib2
:
import urllib2
from cookielib import CookieJar
url1 = 'http://propaccess.traviscad.org/clientdb/?cid=1'
url = 'http://propaccess.traviscad.org/clientdb/Property.aspx?prop_id=312669'
cj = CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
opener.open(url1)
soup = BeautifulSoup(opener.open(url).read())
print soup.prettify()