使用 BeautifulSoup 抓取 HTML 站点并在其中找到 "total_pages" 的值

Scraping a HTML site using BeautifulSoup and finding the value of "total_pages" in it

我正在编写一个 python 代码来抓取以下网站并在其中查找“total_pages”的值。

网站是https://www.usnews.com/best-colleges/fl

当我在浏览器中打开网站并查看源代码时,“total_pages”的值为 8。我希望我的 python 代码能够获得相同的值。

我写了下面的代码:

import requests
from bs4 import BeautifulSoup

headers ={'User-Agent': 'Mozilla/5.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}

main_site=requests.get("https://www.usnews.com/best-colleges/fl",headers=headers)
main_site_content=main_site.content
main_site_content_soup=BeautifulSoup(main_site_content,"html.parser")

但后来我陷入了如何在解析的数据中查找“total_pages”的问题。我试过 find_all() 方法但没有成功。我想我没有正确使用方法。

注意一点:解决方案不一定要使用BeautifulSoup。我只是用了BeautifulSoup,因为我对它有点熟悉。

不需要BeautifulSoup。在这里,我向他们 API 请求获取大学列表。

from rich import print习惯于pretty-printJSON。它应该更易于阅读。

需要更多帮助或建议,请在下方发表评论。

import requests
from rich import print

LINK = "https://www.usnews.com/best-colleges/api/search?format=json&location=Florida&_sort=rank&_sortDirection=asc&_page=1"


def get_data(url):
    print("Making request to:", url)
    response = requests.get(url, timeout=5, headers={"User-Agent": "Mozilla/5.0"})
    if response.status_code == 200:
        print("Request Successful!")
        data = response.json()["data"]
        return data["items"], data["next_link"]
    print("Request failed!")
    return None, None


def main():
    print("Starting Scraping...")
    items, next_link = get_data(LINK)

    # if there's a `next_link`, scrape it.
    while next_link is not None:
        print("Getting data from:", next_link)
        new_items, next_link = get_data(next_link)
        items += new_items

    # cleaning the data, for the pandas dataframe.
    items = [
        {
            "name": item["institution"]["displayName"],
            "state": item["institution"]["state"],
            "city": item["institution"]["city"],
        }
        for item in items
    ]
    df = pd.DataFrame(items)
    print(df.to_markdown())


if __name__ == "__main__":
    main()

输出如下所示:

name state city
0 University of Florida FL Gainesville
1 Florida State University FL Tallahassee
2 University of Miami FL Coral Gables
3 University of South Florida FL Tampa
4 University of Central Florida FL Orlando
5 Florida International University FL Miami
6 Florida A&M University FL Tallahassee
7 Florida Institute of Technology FL Melbourne
8 Nova Southeastern University FL Ft. Lauderdale
... ... ... ...
74 St. John Vianney College Seminary FL Miami
75 St. Petersburg College FL St. Petersburg
76 Tallahassee Community College FL Tallahassee
77 Valencia College FL Orlando