网页抓取时如何从动态呈现的网页中获取更多项目

Question

我正在使用 python 从 Foodpanda 网络抓取餐厅名称。该页面的项目都是通过它们的 <script> 呈现的，所以我无法通过它们的 html css

获取任何数据

foodpanda_url = "https://www.foodpanda.hk/restaurants/new?lat=22.33523782&lng=114.18249102&expedition=pickup&vertical=restaurants"

# send a request to the page, using the Mozilla 5.0 browser header
req = Request(foodpanda_url, headers={'User-Agent' : 'Mozilla/5.0'})
# open the page using our urlopen library
page = urlopen(req)

soup = BeautifulSoup(page.read(), "html.parser")
print(soup.prettify())

str_soup = str(soup.prettify())

我使用以下方法从 str_soup 中解析出供应商字符串：

fp_vendors = list()
vendorlst = str_soup.split("\"discoMeta\":{\"reco_config\":{\"flags\":[]},\"traces\":[]},\"items\":")
opensqr = 0
startobj = 0

for i in range(len(vendorlst)):
if i==0:
    continue
else:
    for cnt in range(len(vendorlst[i])):
        if (vendorlst[i][cnt] == '['):
            opensqr += 1
        elif (vendorlst[i][cnt] == ']'):
            opensqr -= 1
        if opensqr == 0:
            vendorsStr = vendorlst[i][1:cnt]
            opencurly = 0
            for x in range(len(vendorsStr)):
                if vendorsStr[x] == ',':
                    continue
                if (vendorsStr[x] == '{'):
                    opencurly += 1
                elif (vendorsStr[x] == '}'):
                    opencurly -= 1
                if opencurly == 0:
                    vendor = vendorsStr[startobj:x+1]
                    if (vendor not in fp_vendors) and vendor != "":
                        fp_vendors.append(vendor)
                    startobj = x+2 #continue to next {
                    continue
            break

for item in fp_vendors:
#     print(item+"\n")
    itemstr = re.split("\"minimum_pickup_time\":[0-9]+,\"name\":\"", item)[1]
    itemstr = itemstr.split("\",")[0]
    print(itemstr+"\n")
print(len(fp_vendors))

但是，这只是 returns 一小部分餐厅，大约 50 家。我怎样才能获得代码以从 Foodpanda“获取”更多餐厅商品？如何模拟页面的“向下滚动”以便加载更多项目以便我可以获得更多餐厅项目？

Answer 1

使用您的浏览器开发工具您可以轻松监控所有发出的请求。对于您的特殊情况，我发现了这个 api 调用：

https://disco.deliveryhero.io/listing/api/v1/pandora/vendors?latitude=22.33523782&longitude=114.18249102&language_id=1&include=characteristics&dynamic_pricing=0&configuration=Variant1&country=hk&customer_id=&customer_hash=&budgets=&cuisine=&sort=&food_characteristic=&use_free_delivery_label=false&opening_type=pickup&vertical=restaurants&limit=48&offset=48&customer_type=regular

这是您问题的完整解决方案：

import json
import requests

items_list = []
url = "https://disco.deliveryhero.io/listing/api/v1/pandora/vendors?latitude=22.33523782&longitude=114.18249102&language_id=1&include=characteristics&dynamic_pricing=0&configuration=Variant1&country=hk&customer_id=&customer_hash=&budgets=&cuisine=&sort=&food_characteristic=&use_free_delivery_label=false&opening_type=pickup&vertical=restaurants&limit=48&offset={}&customer_type=regular"

for i in range(5):
    resp = requests.get(
        url.format(i * 48),
        headers={
            "x-disco-client-id": "web",
        },
    )
    if resp.status_code == 200:
        items_list += json.loads(resp.text)["data"]["items"]
    print(f"Finished page: {i}")

print(items_list)

网页抓取时如何从动态呈现的网页中获取更多项目

how to get more items from a dynamically rendered webpage when webscraping

python

urllib

beautifulsoup