使用 BeautifulSoup 抓取 HTML 站点并在其中找到 "total_pages" 的值

Question

我正在编写一个 python 代码来抓取以下网站并在其中查找“total_pages”的值。

网站是https://www.usnews.com/best-colleges/fl

当我在浏览器中打开网站并查看源代码时，“total_pages”的值为 8。我希望我的 python 代码能够获得相同的值。

我写了下面的代码：

import requests
from bs4 import BeautifulSoup

headers ={'User-Agent': 'Mozilla/5.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}

main_site=requests.get("https://www.usnews.com/best-colleges/fl",headers=headers)
main_site_content=main_site.content
main_site_content_soup=BeautifulSoup(main_site_content,"html.parser")

但后来我陷入了如何在解析的数据中查找“total_pages”的问题。我试过 find_all() 方法但没有成功。我想我没有正确使用方法。

注意一点：解决方案不一定要使用BeautifulSoup。我只是用了BeautifulSoup，因为我对它有点熟悉。

Answer 1

不需要BeautifulSoup。在这里，我向他们 API 请求获取大学列表。

from rich import print习惯于pretty-printJSON。它应该更易于阅读。

需要更多帮助或建议，请在下方发表评论。

import requests
from rich import print

LINK = "https://www.usnews.com/best-colleges/api/search?format=json&location=Florida&_sort=rank&_sortDirection=asc&_page=1"


def get_data(url):
    print("Making request to:", url)
    response = requests.get(url, timeout=5, headers={"User-Agent": "Mozilla/5.0"})
    if response.status_code == 200:
        print("Request Successful!")
        data = response.json()["data"]
        return data["items"], data["next_link"]
    print("Request failed!")
    return None, None


def main():
    print("Starting Scraping...")
    items, next_link = get_data(LINK)

    # if there's a `next_link`, scrape it.
    while next_link is not None:
        print("Getting data from:", next_link)
        new_items, next_link = get_data(next_link)
        items += new_items

    # cleaning the data, for the pandas dataframe.
    items = [
        {
            "name": item["institution"]["displayName"],
            "state": item["institution"]["state"],
            "city": item["institution"]["city"],
        }
        for item in items
    ]
    df = pd.DataFrame(items)
    print(df.to_markdown())


if __name__ == "__main__":
    main()

输出如下所示：

	name	state	city
0	University of Florida	FL	Gainesville
1	Florida State University	FL	Tallahassee
2	University of Miami	FL	Coral Gables
3	University of South Florida	FL	Tampa
4	University of Central Florida	FL	Orlando
5	Florida International University	FL	Miami
6	Florida A&M University	FL	Tallahassee
7	Florida Institute of Technology	FL	Melbourne
8	Nova Southeastern University	FL	Ft. Lauderdale
...	...	...	...
74	St. John Vianney College Seminary	FL	Miami
75	St. Petersburg College	FL	St. Petersburg
76	Tallahassee Community College	FL	Tallahassee
77	Valencia College	FL	Orlando

使用 BeautifulSoup 抓取 HTML 站点并在其中找到 "total_pages" 的值

Scraping a HTML site using BeautifulSoup and finding the value of "total_pages" in it

html

python

beautifulsoup

web-scraping

python-requests