使用 BeautifulSoup 抓取 HTML 站点并在其中找到 "total_pages" 的值
Scraping a HTML site using BeautifulSoup and finding the value of "total_pages" in it
我正在编写一个 python 代码来抓取以下网站并在其中查找“total_pages”的值。
网站是https://www.usnews.com/best-colleges/fl
当我在浏览器中打开网站并查看源代码时,“total_pages”的值为 8。我希望我的 python 代码能够获得相同的值。
我写了下面的代码:
import requests
from bs4 import BeautifulSoup
headers ={'User-Agent': 'Mozilla/5.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
main_site=requests.get("https://www.usnews.com/best-colleges/fl",headers=headers)
main_site_content=main_site.content
main_site_content_soup=BeautifulSoup(main_site_content,"html.parser")
但后来我陷入了如何在解析的数据中查找“total_pages”的问题。我试过 find_all()
方法但没有成功。我想我没有正确使用方法。
注意一点:解决方案不一定要使用BeautifulSoup。我只是用了BeautifulSoup,因为我对它有点熟悉。
不需要BeautifulSoup。在这里,我向他们 API 请求获取大学列表。
from rich import print
习惯于pretty-printJSON。它应该更易于阅读。
需要更多帮助或建议,请在下方发表评论。
import requests
from rich import print
LINK = "https://www.usnews.com/best-colleges/api/search?format=json&location=Florida&_sort=rank&_sortDirection=asc&_page=1"
def get_data(url):
print("Making request to:", url)
response = requests.get(url, timeout=5, headers={"User-Agent": "Mozilla/5.0"})
if response.status_code == 200:
print("Request Successful!")
data = response.json()["data"]
return data["items"], data["next_link"]
print("Request failed!")
return None, None
def main():
print("Starting Scraping...")
items, next_link = get_data(LINK)
# if there's a `next_link`, scrape it.
while next_link is not None:
print("Getting data from:", next_link)
new_items, next_link = get_data(next_link)
items += new_items
# cleaning the data, for the pandas dataframe.
items = [
{
"name": item["institution"]["displayName"],
"state": item["institution"]["state"],
"city": item["institution"]["city"],
}
for item in items
]
df = pd.DataFrame(items)
print(df.to_markdown())
if __name__ == "__main__":
main()
输出如下所示:
name
state
city
0
University of Florida
FL
Gainesville
1
Florida State University
FL
Tallahassee
2
University of Miami
FL
Coral Gables
3
University of South Florida
FL
Tampa
4
University of Central Florida
FL
Orlando
5
Florida International University
FL
Miami
6
Florida A&M University
FL
Tallahassee
7
Florida Institute of Technology
FL
Melbourne
8
Nova Southeastern University
FL
Ft. Lauderdale
...
...
...
...
74
St. John Vianney College Seminary
FL
Miami
75
St. Petersburg College
FL
St. Petersburg
76
Tallahassee Community College
FL
Tallahassee
77
Valencia College
FL
Orlando
我正在编写一个 python 代码来抓取以下网站并在其中查找“total_pages”的值。
网站是https://www.usnews.com/best-colleges/fl
当我在浏览器中打开网站并查看源代码时,“total_pages”的值为 8。我希望我的 python 代码能够获得相同的值。
我写了下面的代码:
import requests
from bs4 import BeautifulSoup
headers ={'User-Agent': 'Mozilla/5.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
main_site=requests.get("https://www.usnews.com/best-colleges/fl",headers=headers)
main_site_content=main_site.content
main_site_content_soup=BeautifulSoup(main_site_content,"html.parser")
但后来我陷入了如何在解析的数据中查找“total_pages”的问题。我试过 find_all()
方法但没有成功。我想我没有正确使用方法。
注意一点:解决方案不一定要使用BeautifulSoup。我只是用了BeautifulSoup,因为我对它有点熟悉。
不需要BeautifulSoup。在这里,我向他们 API 请求获取大学列表。
from rich import print
习惯于pretty-printJSON。它应该更易于阅读。
需要更多帮助或建议,请在下方发表评论。
import requests
from rich import print
LINK = "https://www.usnews.com/best-colleges/api/search?format=json&location=Florida&_sort=rank&_sortDirection=asc&_page=1"
def get_data(url):
print("Making request to:", url)
response = requests.get(url, timeout=5, headers={"User-Agent": "Mozilla/5.0"})
if response.status_code == 200:
print("Request Successful!")
data = response.json()["data"]
return data["items"], data["next_link"]
print("Request failed!")
return None, None
def main():
print("Starting Scraping...")
items, next_link = get_data(LINK)
# if there's a `next_link`, scrape it.
while next_link is not None:
print("Getting data from:", next_link)
new_items, next_link = get_data(next_link)
items += new_items
# cleaning the data, for the pandas dataframe.
items = [
{
"name": item["institution"]["displayName"],
"state": item["institution"]["state"],
"city": item["institution"]["city"],
}
for item in items
]
df = pd.DataFrame(items)
print(df.to_markdown())
if __name__ == "__main__":
main()
输出如下所示:
name | state | city | |
---|---|---|---|
0 | University of Florida | FL | Gainesville |
1 | Florida State University | FL | Tallahassee |
2 | University of Miami | FL | Coral Gables |
3 | University of South Florida | FL | Tampa |
4 | University of Central Florida | FL | Orlando |
5 | Florida International University | FL | Miami |
6 | Florida A&M University | FL | Tallahassee |
7 | Florida Institute of Technology | FL | Melbourne |
8 | Nova Southeastern University | FL | Ft. Lauderdale |
... | ... | ... | ... |
74 | St. John Vianney College Seminary | FL | Miami |
75 | St. Petersburg College | FL | St. Petersburg |
76 | Tallahassee Community College | FL | Tallahassee |
77 | Valencia College | FL | Orlando |