Goodreads 抓取时的奇怪行为 (Python)
Weird behaviour when Goodreads scraping (Python)
我正在尝试通过提供一些 ISBN 作为输入来抓取 Goodreads,更具体地说是 Goodreads 版本。但是,每次代码 运行 process:
Traceback (most recent call last):
File "C:xxx.py", line 47, in <module>
ed_details = get_editions_details(isbn)
File "C:xxx.py", line 30, in get_editions_details
ed_item = soup.find("div", class_="otherEditionsLink").find("a")
AttributeError: 'NoneType' object has no attribute 'find'
一切都应该是正确的,div class 是正确的,似乎所有的书都有。我检查了每个浏览器,页面看起来都一样。我不知道这是因为库已弃用还是其他原因。
import requests
from bs4 import BeautifulSoup as bs
def get_isbn():
isbns = ['9780544176560', '9781796898279', '9788845278518', '9780374165277', '9781408839973', '9788838919916', '9780349121994', '9781933372006', '9781501167638', '9781427299062', '9788842050285', '9788807018985', '9780340491263', '9789463008594', '9780739349083', '9780156011594', '9780374106140', '9788845251436', '9781609455910']
return isbns
def get_page(base_url, data):
try:
r = requests.get(base_url, params=data)
except Exception as e:
r = None
print(f"Server responded: {e}")
return r
def get_editions_details(isbn):
# Create the search URL with the ISBN of the book
data = {'q': isbn}
book_url = get_page("https://www.goodreads.com/search", data)
# Parse the markup with Beautiful Soup
soup = bs(book_url.text, 'lxml')
# Retrieve from the book's page the link for other editions
# and the total number of editions
ed_item = soup.find("div", class_="otherEditionsLink").find("a")
ed_link = f"https://www.goodreads.com{ed_item['href']}"
ed_num = ed_item.text.strip().split(' ')[-1].strip('()')
# Return a tuple with all the informations
return ((ed_link, int(ed_num), isbn))
if __name__ == "__main__":
# Get the ISBNs from the user
isbns = get_isbn()
# Check all the ISBNs
for isbn in isbns:
ed_details = get_editions_details(isbn)
您应该始终检查 return 值。
book_url = get_page("https://www.goodreads.com/search", data)
soup = bs(book_url.text, 'lxml')
ed_item = soup.find("div", class_="otherEditionsLink").find("a")
在这些语句中,如果任何 returned 值是 None
,您将在尝试调用成员函数时遇到错误。所以如果 soup
是 None
,例如,你会做类似 None.find(....)
的事情,这显然是错误的。
例如,在最后一行中,您可以将其分成两部分来解决此问题:
if ed_item := soup.find("div", class_="otherEditionsLink"):
if ed_item := ed_item.find("a"):
....other code here....
只要 soup
有效,此代码就不会尝试调用 None
值上的函数。
还有其他方法可以处理这种情况。一种是在失败时 return
:
if (ed_item := soup.find("div", class_="otherEditionsLink")) == None:
return None
if (ed_item := ed_item.find("a")) == None:
return None
....other code here....
另一种选择是使用例外:
try:
ed_item = soup.find("div", class_="otherEditionsLink").find("a")
....other code here....
except AttributeError:
return None
我正在尝试通过提供一些 ISBN 作为输入来抓取 Goodreads,更具体地说是 Goodreads 版本。但是,每次代码 运行 process:
Traceback (most recent call last):
File "C:xxx.py", line 47, in <module>
ed_details = get_editions_details(isbn)
File "C:xxx.py", line 30, in get_editions_details
ed_item = soup.find("div", class_="otherEditionsLink").find("a")
AttributeError: 'NoneType' object has no attribute 'find'
一切都应该是正确的,div class 是正确的,似乎所有的书都有。我检查了每个浏览器,页面看起来都一样。我不知道这是因为库已弃用还是其他原因。
import requests
from bs4 import BeautifulSoup as bs
def get_isbn():
isbns = ['9780544176560', '9781796898279', '9788845278518', '9780374165277', '9781408839973', '9788838919916', '9780349121994', '9781933372006', '9781501167638', '9781427299062', '9788842050285', '9788807018985', '9780340491263', '9789463008594', '9780739349083', '9780156011594', '9780374106140', '9788845251436', '9781609455910']
return isbns
def get_page(base_url, data):
try:
r = requests.get(base_url, params=data)
except Exception as e:
r = None
print(f"Server responded: {e}")
return r
def get_editions_details(isbn):
# Create the search URL with the ISBN of the book
data = {'q': isbn}
book_url = get_page("https://www.goodreads.com/search", data)
# Parse the markup with Beautiful Soup
soup = bs(book_url.text, 'lxml')
# Retrieve from the book's page the link for other editions
# and the total number of editions
ed_item = soup.find("div", class_="otherEditionsLink").find("a")
ed_link = f"https://www.goodreads.com{ed_item['href']}"
ed_num = ed_item.text.strip().split(' ')[-1].strip('()')
# Return a tuple with all the informations
return ((ed_link, int(ed_num), isbn))
if __name__ == "__main__":
# Get the ISBNs from the user
isbns = get_isbn()
# Check all the ISBNs
for isbn in isbns:
ed_details = get_editions_details(isbn)
您应该始终检查 return 值。
book_url = get_page("https://www.goodreads.com/search", data)
soup = bs(book_url.text, 'lxml')
ed_item = soup.find("div", class_="otherEditionsLink").find("a")
在这些语句中,如果任何 returned 值是 None
,您将在尝试调用成员函数时遇到错误。所以如果 soup
是 None
,例如,你会做类似 None.find(....)
的事情,这显然是错误的。
例如,在最后一行中,您可以将其分成两部分来解决此问题:
if ed_item := soup.find("div", class_="otherEditionsLink"):
if ed_item := ed_item.find("a"):
....other code here....
只要 soup
有效,此代码就不会尝试调用 None
值上的函数。
还有其他方法可以处理这种情况。一种是在失败时 return
:
if (ed_item := soup.find("div", class_="otherEditionsLink")) == None:
return None
if (ed_item := ed_item.find("a")) == None:
return None
....other code here....
另一种选择是使用例外:
try:
ed_item = soup.find("div", class_="otherEditionsLink").find("a")
....other code here....
except AttributeError:
return None