Beautifulsoup 无法读取页面

Question

我正在尝试以下操作：

from urllib2 import urlopen
from BeautifulSoup import BeautifulSoup
url = 'http://propaccess.traviscad.org/clientdb/Property.aspx?prop_id=312669'
soup = BeautifulSoup(urlopen(url).read())
print soup

上面的print语句显示如下：

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
        "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<meta http-equiv="Content-type" content="text/html;charset=utf-8" />
<title>Travis Property Search</title>
<style type="text/css">
      body { text-align: center; padding: 150px; }
      h1 { font-size: 50px; }
      body { font: 20px Helvetica, sans-serif; color: #333; }
      #article { display: block; text-align: left; width: 650px; margin: 0 auto; }
      a { color: #dc8100; text-decoration: none; }
      a:hover { color: #333; text-decoration: none; }
    </style>
</head>
<body>
<div id="article">
<h1>Please try again</h1>
<div>
<p>Sorry for the inconvenience but your session has either timed out or the server is busy handling other requests. You may visit us on the the following website for information, otherwise please retry your search again shortly:<br /><br />
<a href="http://www.traviscad.org/">Travis Central Appraisal District Website</a> </p>
<p><b><a href="http://propaccess.traviscad.org/clientdb/?cid=1">Click here to reload the property search to try again</a></b></p>
</div>
</div>
</body>
</html>

我可以通过同一台计算机上的浏览器访问 url，因此服务器绝对不会阻止我的 IP。我不明白我的代码有什么问题？

Answer 1

您需要先获取一些cookie，然后才能访问url。
虽然这可以用 urllib2 和 CookieJar 来完成，但我建议 requests :

import requests
from BeautifulSoup import BeautifulSoup

url1 = 'http://propaccess.traviscad.org/clientdb/?cid=1'
url = 'http://propaccess.traviscad.org/clientdb/Property.aspx?prop_id=312669'
ses = requests.Session()
ses.get(url1)
soup = BeautifulSoup(ses.get(url).content)
print soup.prettify()

请注意，requests 不是标准库，您必须安装它。如果你想使用 urllib2 ：

import urllib2
from cookielib import CookieJar

url1 = 'http://propaccess.traviscad.org/clientdb/?cid=1'
url = 'http://propaccess.traviscad.org/clientdb/Property.aspx?prop_id=312669'
cj = CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
opener.open(url1)
soup = BeautifulSoup(opener.open(url).read())
print soup.prettify()

Beautifulsoup 无法读取页面

Beautifulsoup fail to read page

beautifulsoup

urlopen

python-2.7