使用 Python Beautiful Soup 抓取分页页面
Scraping Paginated Pages using Python Beautiful Soup
我的抓取工具正在运行,它从网站上的所有 9 个页面中提取了正确的数据。不过,我遇到的一个问题是我认为我目前使用的方法并不理想(如果页码大于我输入的范围,那么这些结果将被遗漏)。
我的代码如下:
import requests
import time
import csv
import sys
from bs4 import BeautifulSoup
houses = []
url = "https://www.propertypal.com/property-to-rent/newtownabbey/"
page=requests.get(url)
soup=BeautifulSoup(page.text,"lxml")
g_data = soup.findAll("div", {"class": "propbox-details"})
for item in g_data:
try:
title = item.find_all("span", {"class": "propbox-addr"})[0].text
except:
pass
try:
town = item.find_all("span", {"class": "propbox-town"})[0].text
except:
pass
try:
price = item.find_all("span", {"class": "price-value"})[0].text
except:
pass
try:
period = item.find_all("span", {"class": "price-period"})[0].text
except:
pass
course=[title,town,price,period]
houses.append(course)
for i in range(1,15):
time.sleep(2)#delay time requests are sent so we don't get kicked by server
url2 = "https://www.propertypal.com/property-to-rent/newtownabbey/page-{0}".format(i)
page2=requests.get(url2)
print(url2)
soup=BeautifulSoup(page2.text,"lxml")
g_data = soup.findAll("div", {"class": "propbox-details"})
for item in g_data:
try:
title = item.find_all("span", {"class": "propbox-addr"})[0].text
except:
pass
try:
town = item.find_all("span", {"class": "propbox-town"})[0].text
except:
pass
try:
price = item.find_all("span", {"class": "price-value"})[0].text
except:
pass
try:
period = item.find_all("span", {"class": "price-period"})[0].text
except:
pass
course=[title,town,price,period]
houses.append(course)
with open ('newtownabbeyrentalproperties.csv','w') as file:
writer=csv.writer(file)
writer.writerow(['Address','Town', 'Price', 'Period'])
for row in houses:
writer.writerow(row)
从我使用的代码可以看出
for i in range(1,15):
time.sleep(2)#delay time requests are sent so we don't get kicked by server
url2 = "https://www.propertypal.com/property-to-rent/newtownabbey/page-{0}".format(i)
将数字 1 到 14 添加到 &page= 参数中。
这并不理想,因为如果网站有更多的页面,例如第 15、16、17 页,那么爬虫将错过这些页面上的数据,因为它最多只会查看第 14 页用于数据。
有人可以帮助我如何使用分页来查找网页上要抓取的页数,或者设置此 for 循环的更好方法吗?
非常感谢。
向不存在的页面发出请求。例如:https://www.propertypal.com/property-to-rent/newtownabbey/page-999999
查找存在和不存在的页面之间的差异。
解析下一页,直到找到这种差异。
请参阅下面我的修改。该解决方案应该能够继续遍历页面,直到它尝试获取不存在的页面。这样做也有好处,因为在你的代码中你总是会尝试 15 页,即使只有一个或两个或三个等
page_num = 0
http_status_okay = True
while http_status_okay:
page_num = page_num + 1
time.sleep(2)#delay time requests are sent so we don't get kicked by server
url2 = "https://www.propertypal.com/property-to-rent/newtownabbey/page-{0}".format(i)
page2=requests.get(url2)
# continue if we get a 200 response code
if page2.status_code is 200:
http_status_okay = True
else:
http_status_okay = False
大概是这样的(我没测试过,不知道能不能用,只是想说明原理)
button_next = soup.find("a", {"class": "btn paging-next"}, href=True)
while button_next:
time.sleep(2)#delay time requests are sent so we don\'t get kicked by server
url2 = "https://www.propertypal.com{0}".format(button_next["href"])
page2=requests.get(url2)
print(url2)
soup=BeautifulSoup(page2.text,"lxml")
g_data = soup.findAll("div", {"class": "propbox-details"})
for item in g_data:
try:
title = item.find_all("span", {"class": "propbox-addr"})[0].text
except:
pass
try:
town = item.find_all("span", {"class": "propbox-town"})[0].text
except:
pass
try:
price = item.find_all("span", {"class": "price-value"})[0].text
except:
pass
try:
period = item.find_all("span", {"class": "price-period"})[0].text
except:
pass
course=[title,town,price,period]
houses.append(course)
button_next = soup.find("a", {"class": "btn paging-next"}, href=True)
我的抓取工具正在运行,它从网站上的所有 9 个页面中提取了正确的数据。不过,我遇到的一个问题是我认为我目前使用的方法并不理想(如果页码大于我输入的范围,那么这些结果将被遗漏)。
我的代码如下:
import requests
import time
import csv
import sys
from bs4 import BeautifulSoup
houses = []
url = "https://www.propertypal.com/property-to-rent/newtownabbey/"
page=requests.get(url)
soup=BeautifulSoup(page.text,"lxml")
g_data = soup.findAll("div", {"class": "propbox-details"})
for item in g_data:
try:
title = item.find_all("span", {"class": "propbox-addr"})[0].text
except:
pass
try:
town = item.find_all("span", {"class": "propbox-town"})[0].text
except:
pass
try:
price = item.find_all("span", {"class": "price-value"})[0].text
except:
pass
try:
period = item.find_all("span", {"class": "price-period"})[0].text
except:
pass
course=[title,town,price,period]
houses.append(course)
for i in range(1,15):
time.sleep(2)#delay time requests are sent so we don't get kicked by server
url2 = "https://www.propertypal.com/property-to-rent/newtownabbey/page-{0}".format(i)
page2=requests.get(url2)
print(url2)
soup=BeautifulSoup(page2.text,"lxml")
g_data = soup.findAll("div", {"class": "propbox-details"})
for item in g_data:
try:
title = item.find_all("span", {"class": "propbox-addr"})[0].text
except:
pass
try:
town = item.find_all("span", {"class": "propbox-town"})[0].text
except:
pass
try:
price = item.find_all("span", {"class": "price-value"})[0].text
except:
pass
try:
period = item.find_all("span", {"class": "price-period"})[0].text
except:
pass
course=[title,town,price,period]
houses.append(course)
with open ('newtownabbeyrentalproperties.csv','w') as file:
writer=csv.writer(file)
writer.writerow(['Address','Town', 'Price', 'Period'])
for row in houses:
writer.writerow(row)
从我使用的代码可以看出
for i in range(1,15):
time.sleep(2)#delay time requests are sent so we don't get kicked by server
url2 = "https://www.propertypal.com/property-to-rent/newtownabbey/page-{0}".format(i)
将数字 1 到 14 添加到 &page= 参数中。
这并不理想,因为如果网站有更多的页面,例如第 15、16、17 页,那么爬虫将错过这些页面上的数据,因为它最多只会查看第 14 页用于数据。
有人可以帮助我如何使用分页来查找网页上要抓取的页数,或者设置此 for 循环的更好方法吗?
非常感谢。
向不存在的页面发出请求。例如:https://www.propertypal.com/property-to-rent/newtownabbey/page-999999 查找存在和不存在的页面之间的差异。 解析下一页,直到找到这种差异。
请参阅下面我的修改。该解决方案应该能够继续遍历页面,直到它尝试获取不存在的页面。这样做也有好处,因为在你的代码中你总是会尝试 15 页,即使只有一个或两个或三个等
page_num = 0
http_status_okay = True
while http_status_okay:
page_num = page_num + 1
time.sleep(2)#delay time requests are sent so we don't get kicked by server
url2 = "https://www.propertypal.com/property-to-rent/newtownabbey/page-{0}".format(i)
page2=requests.get(url2)
# continue if we get a 200 response code
if page2.status_code is 200:
http_status_okay = True
else:
http_status_okay = False
大概是这样的(我没测试过,不知道能不能用,只是想说明原理)
button_next = soup.find("a", {"class": "btn paging-next"}, href=True)
while button_next:
time.sleep(2)#delay time requests are sent so we don\'t get kicked by server
url2 = "https://www.propertypal.com{0}".format(button_next["href"])
page2=requests.get(url2)
print(url2)
soup=BeautifulSoup(page2.text,"lxml")
g_data = soup.findAll("div", {"class": "propbox-details"})
for item in g_data:
try:
title = item.find_all("span", {"class": "propbox-addr"})[0].text
except:
pass
try:
town = item.find_all("span", {"class": "propbox-town"})[0].text
except:
pass
try:
price = item.find_all("span", {"class": "price-value"})[0].text
except:
pass
try:
period = item.find_all("span", {"class": "price-period"})[0].text
except:
pass
course=[title,town,price,period]
houses.append(course)
button_next = soup.find("a", {"class": "btn paging-next"}, href=True)