使用 Python 从 try/except 的网站上抓取作者姓名

Question

我正在尝试使用 Try/Except 来抓取包含作者数据的 URL 的不同页面。我需要本网站后续 10 个页面的一组作者姓名。

# Import Packages
import requests
import bs4
from bs4 import BeautifulSoup as bs
# Output list
authors = [] 
# Website Main Page URL
URL = 'http://quotes.toscrape.com/'
res = requests.get(URL)
soup = bs4.BeautifulSoup(res.text,"lxml")
# Get the contents from the first page
for item in soup.select(".author"):
    authors.append(item.text)
page = 1
pagesearch = True
# Get the contents from 2-10 pages
while pagesearch:
    # Check if page is available
    try:
            req = requests.get(URL + '/' + 'page/' + str(page) + '/')
            soup = bs(req.text, 'html.parser')
            page = page + 1
            for item in soup.select(".author"): # Append the author class from the webpage html
                authors.append(item.text)  
    except:
        print("Page not found")
        pagesearch == False
        break # Break if no page is remaining

print(set(authors)) # Print the output as a unique set of author names

第一页 URL 中没有任何页码，所以我单独处理了它。我正在使用 try/except 块遍历所有可能的页面并抛出异常并在扫描最后一页时中断循环。

当我运行程序时，它进入了一个无限循环，当页面结束时它需要打印“找不到页面”消息。当我中断内核时，我看到正确的结果作为列表和我的异常语句，但在此之前什么也没有。我得到以下结果。

Page not found
{'Allen Saunders', 'J.K. Rowling', 'Pablo Neruda', 'J.R.R. Tolkien', 'Harper Lee', 'J.M. Barrie', 
 'Thomas A. Edison', 'J.D. Salinger', 'Jorge Luis Borges', 'Haruki Murakami', 'Dr. Seuss', 'George 
  Carlin', 'Alexandre Dumas fils', 'Terry Pratchett', 'C.S. Lewis', 'Ralph Waldo Emerson', 'Jim 
  Henson', 'Suzanne Collins', 'Jane Austen', 'E.E. Cummings', 'Jimi Hendrix', 'Khaled Hosseini', 
 'George Eliot', 'Eleanor Roosevelt', 'André Gide', 'Stephenie Meyer', 'Ayn Rand', 'Friedrich 
  Nietzsche', 'Mother Teresa', 'James Baldwin', 'W.C. Fields', "Madeleine L'Engle", 'William 
  Nicholson', 'George R.R. Martin', 'Marilyn Monroe', 'Albert Einstein', 'George Bernard Shaw', 
 'Ernest Hemingway', 'Steve Martin', 'Martin Luther King Jr.', 'Helen Keller', 'Charles M. Schulz', 
 'Charles Bukowski', 'Alfred Tennyson', 'John Lennon', 'Garrison Keillor', 'Bob Marley', 'Mark 
  Twain', 'Elie Wiesel', 'Douglas Adams'}

这可能是什么原因？谢谢

Answer 1

我认为那是因为字面上有一个页面。当浏览器上没有可显示的页面时，可能会出现异常。但是当你提出这个请求时：

http://quotes.toscrape.com/page/11/

然后，浏览器显示一个 bs4 仍然可以解析并获取元素的页面。

如何在第 11 页停止？您可以跟踪下一页按钮的存在。

感谢阅读。

Answer 2

尝试使用内置的 range() 函数从第 1-10 页开始：

import requests
from bs4 import BeautifulSoup

url = "http://quotes.toscrape.com/page/{}/"
authors = []

for page in range(1, 11):
    response = requests.get(url.format(page))
    print("Requesting Page: {}".format(response.url))
    soup = BeautifulSoup(response.content, "html.parser")
    for tag in soup.select(".author"):
        authors.append(tag.text)

print(set(authors))

使用 Python 从 try/except 的网站上抓取作者姓名

Scraping author names from a website with try/except using Python

python

beautifulsoup

web-scraping

python-3.x

try-except