Goodreads 抓取时的奇怪行为 (Python)

Question

我正在尝试通过提供一些 ISBN 作为输入来抓取 Goodreads，更具体地说是 Goodreads 版本。但是，每次代码运行 process:

Traceback (most recent call last):
  File "C:xxx.py", line 47, in <module>
    ed_details = get_editions_details(isbn)
  File "C:xxx.py", line 30, in get_editions_details
    ed_item = soup.find("div", class_="otherEditionsLink").find("a")
AttributeError: 'NoneType' object has no attribute 'find'

一切都应该是正确的，div class 是正确的，似乎所有的书都有。我检查了每个浏览器，页面看起来都一样。我不知道这是因为库已弃用还是其他原因。

import requests
from bs4 import BeautifulSoup as bs


def get_isbn():
    isbns = ['9780544176560', '9781796898279', '9788845278518', '9780374165277', '9781408839973', '9788838919916', '9780349121994', '9781933372006', '9781501167638', '9781427299062', '9788842050285', '9788807018985', '9780340491263', '9789463008594', '9780739349083', '9780156011594', '9780374106140', '9788845251436', '9781609455910']
    return isbns


def get_page(base_url, data):
    try:
        r = requests.get(base_url, params=data)
    except Exception as e:
        r = None
        print(f"Server responded: {e}")
    return r


def get_editions_details(isbn):
    # Create the search URL with the ISBN of the book
    data = {'q': isbn}
    book_url = get_page("https://www.goodreads.com/search", data)
    # Parse the markup with Beautiful Soup
    soup = bs(book_url.text, 'lxml')

    # Retrieve from the book's page the link for other editions
    # and the total number of editions

    ed_item = soup.find("div", class_="otherEditionsLink").find("a")

    ed_link = f"https://www.goodreads.com{ed_item['href']}"
    ed_num = ed_item.text.strip().split(' ')[-1].strip('()')

    # Return a tuple with all the informations
    return ((ed_link, int(ed_num), isbn))


if __name__ == "__main__":
    # Get the ISBNs from the user
    isbns = get_isbn()

    # Check all the ISBNs
    for isbn in isbns:
        ed_details = get_editions_details(isbn)

Answer 1

您应该始终检查 return 值。

book_url = get_page("https://www.goodreads.com/search", data)
soup = bs(book_url.text, 'lxml')
ed_item = soup.find("div", class_="otherEditionsLink").find("a")

在这些语句中，如果任何 returned 值是 None，您将在尝试调用成员函数时遇到错误。所以如果 soup 是 None，例如，你会做类似 None.find(....) 的事情，这显然是错误的。

例如，在最后一行中，您可以将其分成两部分来解决此问题：

if ed_item := soup.find("div", class_="otherEditionsLink"):
    if ed_item := ed_item.find("a"):
        ....other code here....

只要 soup 有效，此代码就不会尝试调用 None 值上的函数。

还有其他方法可以处理这种情况。一种是在失败时 return：

if (ed_item := soup.find("div", class_="otherEditionsLink")) == None:
    return None
if (ed_item := ed_item.find("a")) == None:
    return None
....other code here....

另一种选择是使用例外：

try:
    ed_item = soup.find("div", class_="otherEditionsLink").find("a")
    ....other code here....
except AttributeError:
    return None

Goodreads 抓取时的奇怪行为 (Python)

Weird behaviour when Goodreads scraping (Python)

python

request

web-scraping

python-3.x

python-requests