For Loop 试图抓取 TripAdvisor 餐厅数据

For Loop trying to scrape TripAdvisor Restaurant data

我正在尝试抓取香港所有餐厅的列表及其相应的 URL。目前,在我下面的代码中,我可以抓取第一页和第二页。但是我希望我的 for 循环到底部更加动态并继续抓取直到它达到我在 range() 中指定的条目数量。

我在这方面还是个新手,所以任何帮助都会很棒。

#import libraries
import requests
from bs4 import BeautifulSoup
import csv


#scrape the first page because this URL is different then when you start moving to different pages
url0 = 'https://www.tripadvisor.com/Restaurants-g294217-Hong_Kong.html#EATERY_LIST_CONTENTS'
r = requests.get(url0)
data = r.text
soup = BeautifulSoup(r.text, "html.parser")
for link in soup.findAll('a', {'property_title'}):
    print 'https://www.tripadvisor.com/Restaurant_Review-g294217-' + link.get('href')
    print link.string

#loop to move into the next pages. entries are in increments of 30 per page
for i in range(0, 120, 30):
    entries = str(30)
    #url format offsets the restaurants in increments of 30 after the oa; hence entries as variable
    url1 = 'https://www.tripadvisor.com/Restaurants-g294217-oa' + entries + '-Hong_Kong.html#EATERY_LIST_CONTENTS'
    r1 = requests.get(url1)
    data1 = r1.text
    soup1 = BeautifulSoup(data1, "html.parser")
    for link in soup1.findAll('a', {'property_title'}):
        print 'https://www.tripadvisor.com/Restaurant_Review-g294217-' + link.get('href')
        print link.string
    break

最后添加了一段时间,让它按照我想要的方式循环。希望这对以后的人有帮助

for i in range(30, 120, 30):
    while i <= range:
        i = str(i)
        #url format offsets the restaurants in increments of 30 after the oa; hence entries as variable
        url1 = 'https://www.tripadvisor.com/Restaurants-g294217-oa' + i + '-Hong_Kong.html#EATERY_LIST_CONTENTS'
        r1 = requests.get(url1)
        data1 = r1.text
        soup1 = BeautifulSoup(data1, "html.parser")
        for link in soup1.findAll('a', {'property_title'}):
            print 'https://www.tripadvisor.com/Restaurant_Review-g294217-' + link.get('href')
            print link.string
        break