网页抓取时如何从动态呈现的网页中获取更多项目
how to get more items from a dynamically rendered webpage when webscraping
我正在使用 python 从 Foodpanda 网络抓取餐厅名称。该页面的项目都是通过它们的 <script>
呈现的,所以我无法通过它们的 html css
获取任何数据
foodpanda_url = "https://www.foodpanda.hk/restaurants/new?lat=22.33523782&lng=114.18249102&expedition=pickup&vertical=restaurants"
# send a request to the page, using the Mozilla 5.0 browser header
req = Request(foodpanda_url, headers={'User-Agent' : 'Mozilla/5.0'})
# open the page using our urlopen library
page = urlopen(req)
soup = BeautifulSoup(page.read(), "html.parser")
print(soup.prettify())
str_soup = str(soup.prettify())
我使用以下方法从 str_soup 中解析出供应商字符串:
fp_vendors = list()
vendorlst = str_soup.split("\"discoMeta\":{\"reco_config\":{\"flags\":[]},\"traces\":[]},\"items\":")
opensqr = 0
startobj = 0
for i in range(len(vendorlst)):
if i==0:
continue
else:
for cnt in range(len(vendorlst[i])):
if (vendorlst[i][cnt] == '['):
opensqr += 1
elif (vendorlst[i][cnt] == ']'):
opensqr -= 1
if opensqr == 0:
vendorsStr = vendorlst[i][1:cnt]
opencurly = 0
for x in range(len(vendorsStr)):
if vendorsStr[x] == ',':
continue
if (vendorsStr[x] == '{'):
opencurly += 1
elif (vendorsStr[x] == '}'):
opencurly -= 1
if opencurly == 0:
vendor = vendorsStr[startobj:x+1]
if (vendor not in fp_vendors) and vendor != "":
fp_vendors.append(vendor)
startobj = x+2 #continue to next {
continue
break
for item in fp_vendors:
# print(item+"\n")
itemstr = re.split("\"minimum_pickup_time\":[0-9]+,\"name\":\"", item)[1]
itemstr = itemstr.split("\",")[0]
print(itemstr+"\n")
print(len(fp_vendors))
但是,这只是 returns 一小部分餐厅,大约 50 家。我怎样才能获得代码以从 Foodpanda“获取”更多餐厅商品?如何模拟页面的“向下滚动”以便加载更多项目以便我可以获得更多餐厅项目?
使用您的浏览器开发工具您可以轻松监控所有发出的请求。对于您的特殊情况,我发现了这个 api 调用:
这是您问题的完整解决方案:
import json
import requests
items_list = []
url = "https://disco.deliveryhero.io/listing/api/v1/pandora/vendors?latitude=22.33523782&longitude=114.18249102&language_id=1&include=characteristics&dynamic_pricing=0&configuration=Variant1&country=hk&customer_id=&customer_hash=&budgets=&cuisine=&sort=&food_characteristic=&use_free_delivery_label=false&opening_type=pickup&vertical=restaurants&limit=48&offset={}&customer_type=regular"
for i in range(5):
resp = requests.get(
url.format(i * 48),
headers={
"x-disco-client-id": "web",
},
)
if resp.status_code == 200:
items_list += json.loads(resp.text)["data"]["items"]
print(f"Finished page: {i}")
print(items_list)
我正在使用 python 从 Foodpanda 网络抓取餐厅名称。该页面的项目都是通过它们的 <script>
呈现的,所以我无法通过它们的 html css
foodpanda_url = "https://www.foodpanda.hk/restaurants/new?lat=22.33523782&lng=114.18249102&expedition=pickup&vertical=restaurants"
# send a request to the page, using the Mozilla 5.0 browser header
req = Request(foodpanda_url, headers={'User-Agent' : 'Mozilla/5.0'})
# open the page using our urlopen library
page = urlopen(req)
soup = BeautifulSoup(page.read(), "html.parser")
print(soup.prettify())
str_soup = str(soup.prettify())
我使用以下方法从 str_soup 中解析出供应商字符串:
fp_vendors = list()
vendorlst = str_soup.split("\"discoMeta\":{\"reco_config\":{\"flags\":[]},\"traces\":[]},\"items\":")
opensqr = 0
startobj = 0
for i in range(len(vendorlst)):
if i==0:
continue
else:
for cnt in range(len(vendorlst[i])):
if (vendorlst[i][cnt] == '['):
opensqr += 1
elif (vendorlst[i][cnt] == ']'):
opensqr -= 1
if opensqr == 0:
vendorsStr = vendorlst[i][1:cnt]
opencurly = 0
for x in range(len(vendorsStr)):
if vendorsStr[x] == ',':
continue
if (vendorsStr[x] == '{'):
opencurly += 1
elif (vendorsStr[x] == '}'):
opencurly -= 1
if opencurly == 0:
vendor = vendorsStr[startobj:x+1]
if (vendor not in fp_vendors) and vendor != "":
fp_vendors.append(vendor)
startobj = x+2 #continue to next {
continue
break
for item in fp_vendors:
# print(item+"\n")
itemstr = re.split("\"minimum_pickup_time\":[0-9]+,\"name\":\"", item)[1]
itemstr = itemstr.split("\",")[0]
print(itemstr+"\n")
print(len(fp_vendors))
但是,这只是 returns 一小部分餐厅,大约 50 家。我怎样才能获得代码以从 Foodpanda“获取”更多餐厅商品?如何模拟页面的“向下滚动”以便加载更多项目以便我可以获得更多餐厅项目?
使用您的浏览器开发工具您可以轻松监控所有发出的请求。对于您的特殊情况,我发现了这个 api 调用:
这是您问题的完整解决方案:
import json
import requests
items_list = []
url = "https://disco.deliveryhero.io/listing/api/v1/pandora/vendors?latitude=22.33523782&longitude=114.18249102&language_id=1&include=characteristics&dynamic_pricing=0&configuration=Variant1&country=hk&customer_id=&customer_hash=&budgets=&cuisine=&sort=&food_characteristic=&use_free_delivery_label=false&opening_type=pickup&vertical=restaurants&limit=48&offset={}&customer_type=regular"
for i in range(5):
resp = requests.get(
url.format(i * 48),
headers={
"x-disco-client-id": "web",
},
)
if resp.status_code == 200:
items_list += json.loads(resp.text)["data"]["items"]
print(f"Finished page: {i}")
print(items_list)