输出的 img alt 值不正确（Python3，Beautiful Soup 4）

Question

我一直在研究餐厅食品卫生刮板。我已经能够让抓取工具根据邮政编码抓取餐馆的名称、地址和卫生等级。由于食品卫生是通过在线图像显示的，因此我设置了抓取程序来读取 "alt=" 参数，其中包含食品卫生评分的数值。

包含我针对食品卫生评级的 img alt 标签的 div 如下所示：

<div class="rating-image" style="clear: right;">
            <a href="/business/abbey-community-college-newtownabbey-antrim-992915.html" title="View Details">
                <img src="https://images.scoresonthedoors.org.uk//schemes/735/on_small.png" alt="5 (Very Good)">
            </a>
        </div>

我已经能够在每家餐厅旁边输出食品卫生评分。

但我的问题是，我注意到一些餐厅旁边显示的读数不正确，例如食品卫生等级为 3 而不是 4（这存储在 img alt 标签中）

最初爬虫连接的link是

https://www.scoresonthedoors.org.uk/search.php?name=&address=&postcode=BT367NG&distance=1&search.x=16&search.y=21&gbt_id=0

我认为这可能与 "for item in g_data for loop" 中循环评级的位置有关。

我发现如果我移动

appendhygiene(scrape=[name,address,bleh])

下面循环外的一段代码

for rating in ratings:
                bleh = rating['alt']

正确地抓取了具有正确卫生评分的数据，唯一的问题是并非所有记录都被抓取，在这种情况下它只输出前 9 家餐厅。

感谢任何可以查看我下面的代码并提供帮助以解决问题的人。

P.S，我使用邮政编码 BT367NG 来抓取餐厅（如果你测试了脚本，你可以使用它来查看没有显示正确卫生值的餐厅，例如 Lins Garden 在网站上是 4，抓取的数据显示 3).

我的完整代码如下：

import requests
import time
import csv
import sys
from bs4 import BeautifulSoup

hygiene = []

def deletelist():
    hygiene.clear()


def savefile():
    filename = input("Please input name of file to be saved")        
    with open (filename + '.csv','w') as file:
       writer=csv.writer(file)
       writer.writerow(['Address','Town', 'Price', 'Period'])
       for row in hygiene:
          writer.writerow(row)
    print("File Saved Successfully")


def appendhygiene(scrape):
    hygiene.append(scrape)

def makesoup(url):
    page=requests.get(url)
    print(url + "  scraped successfully")
    return BeautifulSoup(page.text,"lxml")


def hygienescrape(g_data, ratings):
    for item in g_data:
        try:
            name = (item.find_all("a", {"class": "name"})[0].text)
        except:
            pass
        try:
            address = (item.find_all("span", {"class": "address"})[0].text)
        except:
            pass
        try:
            for rating in ratings:
                    bleh = rating['alt']

        except:
            pass

        appendhygiene(scrape=[name,address,bleh])








def hygieneratings():

    search = input("Please enter postcode")
    soup=makesoup(url = "https://www.scoresonthedoors.org.uk/search.php?name=&address=&postcode=" + search + "&distance=1&search.x=16&search.y=21&gbt_id=0")
    hygienescrape(g_data = soup.findAll("div", {"class": "search-result"}), ratings = soup.select('div.rating-image img[alt]'))

    button_next = soup.find("a", {"rel": "next"}, href=True)
    while button_next:
        time.sleep(2)#delay time requests are sent so we don't get kicked by server
        soup=makesoup(url = "https://www.scoresonthedoors.org.uk/search.php{0}".format(button_next["href"]))
        hygienescrape(g_data = soup.findAll("div", {"class": "search-result"}), ratings = soup.select('div.rating-image img[alt]'))

        button_next = soup.find("a", {"rel" : "next"}, href=True)


def menu():
        strs = ('Enter 1 to search Food Hygiene ratings \n'
            'Enter 2 to Exit\n' )
        choice = input(strs)
        return int(choice) 

while True:          #use while True
    choice = menu()
    if choice == 1:
        hygieneratings()
        savefile()
        deletelist()
    elif choice == 2:
        break
    elif choice == 3:
        break

Answer 1

看来你的问题出在这里：

try:
    for rating in ratings:
        bleh = rating['alt']

except:
    pass

appendhygiene(scrape=[name,address,bleh])

最终的结果是在每一页上附加最后一个值。所以这就是为什么如果最后一个值是 "exempt," 所有值都将被豁免。如果评级为 3，则该页面上的所有值都将为 3。依此类推。

你想要的是这样写：

try:
    bleh = item.find_all('img', {'alt': True})[0]['alt']
    appendhygiene(scrape=[name,address,bleh])

except:
    pass

以便单独附加每个评级，而不是简单地附加最后一个。我刚刚测试了它，它似乎有效:)

输出的 img alt 值不正确（Python3，Beautiful Soup 4）

Incorrect img alt value being outputted (Python3, Beautiful Soup 4)

python

screen-scraping

beautifulsoup

scrape

python-3.6