输出的 img alt 值不正确(Python3,Beautiful Soup 4)
Incorrect img alt value being outputted (Python3, Beautiful Soup 4)
我一直在研究餐厅食品卫生刮板。我已经能够让抓取工具根据邮政编码抓取餐馆的名称、地址和卫生等级。由于食品卫生是通过在线图像显示的,因此我设置了抓取程序来读取 "alt=" 参数,其中包含食品卫生评分的数值。
包含我针对食品卫生评级的 img alt 标签的 div 如下所示:
<div class="rating-image" style="clear: right;">
<a href="/business/abbey-community-college-newtownabbey-antrim-992915.html" title="View Details">
<img src="https://images.scoresonthedoors.org.uk//schemes/735/on_small.png" alt="5 (Very Good)">
</a>
</div>
我已经能够在每家餐厅旁边输出食品卫生评分。
但我的问题是,我注意到一些餐厅旁边显示的读数不正确,例如食品卫生等级为 3 而不是 4(这存储在 img alt 标签中)
最初爬虫连接的link是
我认为这可能与 "for item in g_data for loop" 中循环评级的位置有关。
我发现如果我移动
appendhygiene(scrape=[name,address,bleh])
下面循环外的一段代码
for rating in ratings:
bleh = rating['alt']
正确地抓取了具有正确卫生评分的数据,唯一的问题是并非所有记录都被抓取,在这种情况下它只输出前 9 家餐厅。
感谢任何可以查看我下面的代码并提供帮助以解决问题的人。
P.S,我使用邮政编码 BT367NG 来抓取餐厅(如果你测试了脚本,你可以使用它来查看没有显示正确卫生值的餐厅,例如 Lins Garden 在网站上是 4,抓取的数据显示 3).
我的完整代码如下:
import requests
import time
import csv
import sys
from bs4 import BeautifulSoup
hygiene = []
def deletelist():
hygiene.clear()
def savefile():
filename = input("Please input name of file to be saved")
with open (filename + '.csv','w') as file:
writer=csv.writer(file)
writer.writerow(['Address','Town', 'Price', 'Period'])
for row in hygiene:
writer.writerow(row)
print("File Saved Successfully")
def appendhygiene(scrape):
hygiene.append(scrape)
def makesoup(url):
page=requests.get(url)
print(url + " scraped successfully")
return BeautifulSoup(page.text,"lxml")
def hygienescrape(g_data, ratings):
for item in g_data:
try:
name = (item.find_all("a", {"class": "name"})[0].text)
except:
pass
try:
address = (item.find_all("span", {"class": "address"})[0].text)
except:
pass
try:
for rating in ratings:
bleh = rating['alt']
except:
pass
appendhygiene(scrape=[name,address,bleh])
def hygieneratings():
search = input("Please enter postcode")
soup=makesoup(url = "https://www.scoresonthedoors.org.uk/search.php?name=&address=&postcode=" + search + "&distance=1&search.x=16&search.y=21&gbt_id=0")
hygienescrape(g_data = soup.findAll("div", {"class": "search-result"}), ratings = soup.select('div.rating-image img[alt]'))
button_next = soup.find("a", {"rel": "next"}, href=True)
while button_next:
time.sleep(2)#delay time requests are sent so we don't get kicked by server
soup=makesoup(url = "https://www.scoresonthedoors.org.uk/search.php{0}".format(button_next["href"]))
hygienescrape(g_data = soup.findAll("div", {"class": "search-result"}), ratings = soup.select('div.rating-image img[alt]'))
button_next = soup.find("a", {"rel" : "next"}, href=True)
def menu():
strs = ('Enter 1 to search Food Hygiene ratings \n'
'Enter 2 to Exit\n' )
choice = input(strs)
return int(choice)
while True: #use while True
choice = menu()
if choice == 1:
hygieneratings()
savefile()
deletelist()
elif choice == 2:
break
elif choice == 3:
break
看来你的问题出在这里:
try:
for rating in ratings:
bleh = rating['alt']
except:
pass
appendhygiene(scrape=[name,address,bleh])
最终的结果是在每一页上附加最后一个值。所以这就是为什么如果最后一个值是 "exempt," 所有值都将被豁免。如果评级为 3,则该页面上的所有值都将为 3。依此类推。
你想要的是这样写:
try:
bleh = item.find_all('img', {'alt': True})[0]['alt']
appendhygiene(scrape=[name,address,bleh])
except:
pass
以便单独附加每个评级,而不是简单地附加最后一个。我刚刚测试了它,它似乎有效:)
我一直在研究餐厅食品卫生刮板。我已经能够让抓取工具根据邮政编码抓取餐馆的名称、地址和卫生等级。由于食品卫生是通过在线图像显示的,因此我设置了抓取程序来读取 "alt=" 参数,其中包含食品卫生评分的数值。
包含我针对食品卫生评级的 img alt 标签的 div 如下所示:
<div class="rating-image" style="clear: right;">
<a href="/business/abbey-community-college-newtownabbey-antrim-992915.html" title="View Details">
<img src="https://images.scoresonthedoors.org.uk//schemes/735/on_small.png" alt="5 (Very Good)">
</a>
</div>
我已经能够在每家餐厅旁边输出食品卫生评分。
但我的问题是,我注意到一些餐厅旁边显示的读数不正确,例如食品卫生等级为 3 而不是 4(这存储在 img alt 标签中)
最初爬虫连接的link是
我认为这可能与 "for item in g_data for loop" 中循环评级的位置有关。
我发现如果我移动
appendhygiene(scrape=[name,address,bleh])
下面循环外的一段代码
for rating in ratings:
bleh = rating['alt']
正确地抓取了具有正确卫生评分的数据,唯一的问题是并非所有记录都被抓取,在这种情况下它只输出前 9 家餐厅。
感谢任何可以查看我下面的代码并提供帮助以解决问题的人。
P.S,我使用邮政编码 BT367NG 来抓取餐厅(如果你测试了脚本,你可以使用它来查看没有显示正确卫生值的餐厅,例如 Lins Garden 在网站上是 4,抓取的数据显示 3).
我的完整代码如下:
import requests
import time
import csv
import sys
from bs4 import BeautifulSoup
hygiene = []
def deletelist():
hygiene.clear()
def savefile():
filename = input("Please input name of file to be saved")
with open (filename + '.csv','w') as file:
writer=csv.writer(file)
writer.writerow(['Address','Town', 'Price', 'Period'])
for row in hygiene:
writer.writerow(row)
print("File Saved Successfully")
def appendhygiene(scrape):
hygiene.append(scrape)
def makesoup(url):
page=requests.get(url)
print(url + " scraped successfully")
return BeautifulSoup(page.text,"lxml")
def hygienescrape(g_data, ratings):
for item in g_data:
try:
name = (item.find_all("a", {"class": "name"})[0].text)
except:
pass
try:
address = (item.find_all("span", {"class": "address"})[0].text)
except:
pass
try:
for rating in ratings:
bleh = rating['alt']
except:
pass
appendhygiene(scrape=[name,address,bleh])
def hygieneratings():
search = input("Please enter postcode")
soup=makesoup(url = "https://www.scoresonthedoors.org.uk/search.php?name=&address=&postcode=" + search + "&distance=1&search.x=16&search.y=21&gbt_id=0")
hygienescrape(g_data = soup.findAll("div", {"class": "search-result"}), ratings = soup.select('div.rating-image img[alt]'))
button_next = soup.find("a", {"rel": "next"}, href=True)
while button_next:
time.sleep(2)#delay time requests are sent so we don't get kicked by server
soup=makesoup(url = "https://www.scoresonthedoors.org.uk/search.php{0}".format(button_next["href"]))
hygienescrape(g_data = soup.findAll("div", {"class": "search-result"}), ratings = soup.select('div.rating-image img[alt]'))
button_next = soup.find("a", {"rel" : "next"}, href=True)
def menu():
strs = ('Enter 1 to search Food Hygiene ratings \n'
'Enter 2 to Exit\n' )
choice = input(strs)
return int(choice)
while True: #use while True
choice = menu()
if choice == 1:
hygieneratings()
savefile()
deletelist()
elif choice == 2:
break
elif choice == 3:
break
看来你的问题出在这里:
try:
for rating in ratings:
bleh = rating['alt']
except:
pass
appendhygiene(scrape=[name,address,bleh])
最终的结果是在每一页上附加最后一个值。所以这就是为什么如果最后一个值是 "exempt," 所有值都将被豁免。如果评级为 3,则该页面上的所有值都将为 3。依此类推。
你想要的是这样写:
try:
bleh = item.find_all('img', {'alt': True})[0]['alt']
appendhygiene(scrape=[name,address,bleh])
except:
pass
以便单独附加每个评级,而不是简单地附加最后一个。我刚刚测试了它,它似乎有效:)