抓取:循环抓取多个页面 (Beautifulsoup)
Scraping: scrape multiple pages in looping (Beautifulsoup)
我正在尝试使用 Beautifulsoup 抓取房地产数据,但是当我将抓取结果保存到 .csv 文件时,它只包含第一页的信息。我想抓取我在“pages_number”变量中设置的页数。
# How many pages
pages_number =int(input('How many pages? '))
# inicializa o tempo de execução
tic = time.time()
# Chromedriver
chromedriver = "./chromedriver"
os.environ["webdriver.chrome.driver"] = chromedriver
driver = webdriver.Chrome(chromedriver)
#initial link
link = 'https://www.vivareal.com.br/aluguel/sp/sao-paulo/?__vt=lnv:a&page=1'
driver.get(link)
# creating looping pages
for page in range(1,pages_number+1):
time.sleep(15)
data = driver.execute_script("return document.getElementsByTagName('html' [0].innerHTML")
soup_complete_source = BeautifulSoup(data.encode('utf-8'), "lxml")
我已经尝试过此解决方案但出现错误:
link = 'https://www.vivareal.com.br/aluguel/sp/sao-paulo/?__vt=lnv:a&page={}.format(page)'
有人知道可以做什么吗?
完整代码
https://github.com/arturlunardi/webscraping_vivareal/blob/main/scrap_vivareal.ipynb
我看到您使用的 url 仅属于第 1 页。
https://www.vivareal.com.br/aluguel/sp/sao-paulo/?__vt=lnv:a&page=1
您是否在代码中的任何地方更改它?如果不是,那么无论你取什么,它都只会从第1页取。
你应该这样做:
for page in range(1,pages_number+1):
chromedriver = "./chromedriver"
os.environ["webdriver.chrome.driver"] = chromedriver
driver = webdriver.Chrome(chromedriver)
#initial link
link = f"https://www.vivareal.com.br/aluguel/sp/sao-paulo/?__vt=lnv:a&page={page}"
driver.get(link)
time.sleep(15)
data = driver.execute_script("return document.getElementsByTagName('html' [0].innerHTML")
soup_complete_source = BeautifulSoup(data.encode('utf-8'), "lxml")
driver.close()
测试输出(不是汤部分)- pages_number = 3
(在列表中存储 urls,以便于查看):
['https://www.vivareal.com.br/aluguel/sp/sao-paulo/?__vt=lnv:a&page=1', 'https://www.vivareal.com.br/aluguel/sp/sao-paulo/?__vt=lnv:a&page=2', 'https://www.vivareal.com.br/aluguel/sp/sao-paulo/?__vt=lnv:a&page=3']
Process finished with exit code 0
我正在尝试使用 Beautifulsoup 抓取房地产数据,但是当我将抓取结果保存到 .csv 文件时,它只包含第一页的信息。我想抓取我在“pages_number”变量中设置的页数。
# How many pages
pages_number =int(input('How many pages? '))
# inicializa o tempo de execução
tic = time.time()
# Chromedriver
chromedriver = "./chromedriver"
os.environ["webdriver.chrome.driver"] = chromedriver
driver = webdriver.Chrome(chromedriver)
#initial link
link = 'https://www.vivareal.com.br/aluguel/sp/sao-paulo/?__vt=lnv:a&page=1'
driver.get(link)
# creating looping pages
for page in range(1,pages_number+1):
time.sleep(15)
data = driver.execute_script("return document.getElementsByTagName('html' [0].innerHTML")
soup_complete_source = BeautifulSoup(data.encode('utf-8'), "lxml")
我已经尝试过此解决方案但出现错误:
link = 'https://www.vivareal.com.br/aluguel/sp/sao-paulo/?__vt=lnv:a&page={}.format(page)'
有人知道可以做什么吗?
完整代码
https://github.com/arturlunardi/webscraping_vivareal/blob/main/scrap_vivareal.ipynb
我看到您使用的 url 仅属于第 1 页。
https://www.vivareal.com.br/aluguel/sp/sao-paulo/?__vt=lnv:a&page=1
您是否在代码中的任何地方更改它?如果不是,那么无论你取什么,它都只会从第1页取。
你应该这样做:
for page in range(1,pages_number+1):
chromedriver = "./chromedriver"
os.environ["webdriver.chrome.driver"] = chromedriver
driver = webdriver.Chrome(chromedriver)
#initial link
link = f"https://www.vivareal.com.br/aluguel/sp/sao-paulo/?__vt=lnv:a&page={page}"
driver.get(link)
time.sleep(15)
data = driver.execute_script("return document.getElementsByTagName('html' [0].innerHTML")
soup_complete_source = BeautifulSoup(data.encode('utf-8'), "lxml")
driver.close()
测试输出(不是汤部分)- pages_number = 3
(在列表中存储 urls,以便于查看):
['https://www.vivareal.com.br/aluguel/sp/sao-paulo/?__vt=lnv:a&page=1', 'https://www.vivareal.com.br/aluguel/sp/sao-paulo/?__vt=lnv:a&page=2', 'https://www.vivareal.com.br/aluguel/sp/sao-paulo/?__vt=lnv:a&page=3']
Process finished with exit code 0