针对 AJAX 请求，使用 Python 抓取 booking.com

Question

我正在尝试从 booking.com 抓取数据，现在几乎一切正常，但我无法获得价格，到目前为止我读到这是因为这些价格是通过 [=23= 加载的] 来电。这是我的代码：

import requests
import re

from bs4 import BeautifulSoup

url = "http://www.booking.com/searchresults.pl.html"
payload = {
'ss':'Warszawa', 
'si':'ai,co,ci,re,di',
'dest_type':'city',
'dest_id':'-534433',
'checkin_monthday':'25',
'checkin_year_month':'2015-10',
'checkout_monthday':'26',
'checkout_year_month':'2015-10',
'sb_travel_purpose':'leisure',
'src':'index',
'nflt':'',
'ss_raw':'',
'dcid':'4'
}

r = requests.post(url, payload)
html = r.content
parsed_html = BeautifulSoup(html, "html.parser")

print parsed_html.head.find('title').text

tables = parsed_html.find_all("table", {"class" : "sr_item_legacy"})

print "Found %s records." % len(tables)

with open("requests_results.html", "w") as f:
    f.write(r.content)

for table in tables:
    name = table.find("a", {"class" : "hotel_name_link url"})
    average = table.find("span", {"class" : "average"})
    price = table.find("strong", {"class" : re.compile(r".*\bprice scarcity_color\b.*")})
    print name.text + " " + average.text + " " + price.text

使用 Chrome 中的 Developers Tools 我注意到该网页发送了包含所有数据（包括价格）的原始响应。从其中一个选项卡处理响应内容后，有原始值和价格，为什么我无法使用我的脚本检索它们，如何解决？

Answer 1

第一个问题是站点格式错误：一个 div 在您的 table 中打开，另一个 em 关闭。所以 html.parser 找不到包含价格的 strong 标签。您可以通过安装和使用 lxml:

来解决这个问题

parsed_html = BeautifulSoup(html, "lxml")

第二个问题出在你的正则表达式中。它没有找到任何东西。将其更改为以下内容：

price = table.find("strong", {"class" : re.compile(r".*\bscarcity_color\b.*")})

现在您会找到价格。但是，有些条目不包含任何价格，因此您的 print 语句将引发错误。要解决此问题，您可以将 print 更改为以下内容：

print name.text, average.text, price.text if price else 'No price found'

请注意，您可以在 Python 中使用逗号 (,) 分隔要打印的字段，因此您不需要将它们与 + " " +.

连接起来

针对 AJAX 请求，使用 Python 抓取 booking.com

Scrape booking.com with Python against AJAX requests

python

beautifulsoup

web-scraping