使用 beautifulsoup 到达最后一页时结束 while 循环
End while-loop when last page is reached using beautifulsoup
我正在尝试从此站点抓取贸易出版物:https://www.webwire.com/IndustryList.asp
我可以毫无问题地浏览每个 个人 行业部分(例如,'Airlines / Aviation' 或 'Automotive'),但是当我的循环到达结果的最后一页,并且不会进入 for 循环中的 下一个 行业。
我认为我也没有遇到异常,那么如何在循环到达最后一个可用页面时结束循环,以便继续执行 for 循环中的下一项?
import requests
from bs4 import BeautifulSoup
industries = ["AIR","AUT","LEI"]
for industry in industries:
print(industry)
print("==================")
num = 1
while True:
url = f"https://www.webwire.com/TradePublications.asp?ind={industry}&curpage={num}"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
for e in soup.select('#syndication-list li'):
print(e.get_text())
num = num + 1
else:
break
您可以使用 for 循环和 range 函数进行分页,如下所示:
import requests
from bs4 import BeautifulSoup
industries = ["AIR","AUT","LEI"]
for industry in industries:
# print(industry)
# print("==================")
#url = f"https://www.webwire.com/TradePublications.asp?ind={industry}&curpage=1"
#print(url)
for page in range(1,14):
print(page)
url=f'https://www.webwire.com/TradePublications.asp?ind={industry}&curpage={page}'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
for e in soup.select('#syndication-list li'):
print(e.get_text())
此示例将遍历列表 industries
并获取所有页面,直到最后一页:
import requests
from bs4 import BeautifulSoup
industries = ["AIR", "AUT", "LEI"]
url = "https://www.webwire.com/TradePublications.asp?ind={}&curpage={}"
for ind in industries:
u = url.format(ind, 1)
while True:
soup = BeautifulSoup(requests.get(u).content, "html.parser")
for li in soup.select("#syndication-list li"):
print("{:<10} {}".format(ind, li.text))
next_page = soup.select_one('a:-soup-contains("Next »")')
if next_page:
u = (
"https://www.webwire.com/TradePublications.asp"
+ next_page["href"]
)
else:
break
打印:
...
LEI Women's Wear Daily/Fairchild Financial
LEI Worcester Quarterly Magazine
LEI Word Association/Econoguide Travel Books
LEI Worldwide Spa Review
LEI Worth Magazine
LEI Y Not Girl Magazine
LEI Yankee Driver
LEI Ziff Davis Media
我正在尝试从此站点抓取贸易出版物:https://www.webwire.com/IndustryList.asp
我可以毫无问题地浏览每个 个人 行业部分(例如,'Airlines / Aviation' 或 'Automotive'),但是当我的循环到达结果的最后一页,并且不会进入 for 循环中的 下一个 行业。
我认为我也没有遇到异常,那么如何在循环到达最后一个可用页面时结束循环,以便继续执行 for 循环中的下一项?
import requests
from bs4 import BeautifulSoup
industries = ["AIR","AUT","LEI"]
for industry in industries:
print(industry)
print("==================")
num = 1
while True:
url = f"https://www.webwire.com/TradePublications.asp?ind={industry}&curpage={num}"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
for e in soup.select('#syndication-list li'):
print(e.get_text())
num = num + 1
else:
break
您可以使用 for 循环和 range 函数进行分页,如下所示:
import requests
from bs4 import BeautifulSoup
industries = ["AIR","AUT","LEI"]
for industry in industries:
# print(industry)
# print("==================")
#url = f"https://www.webwire.com/TradePublications.asp?ind={industry}&curpage=1"
#print(url)
for page in range(1,14):
print(page)
url=f'https://www.webwire.com/TradePublications.asp?ind={industry}&curpage={page}'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
for e in soup.select('#syndication-list li'):
print(e.get_text())
此示例将遍历列表 industries
并获取所有页面,直到最后一页:
import requests
from bs4 import BeautifulSoup
industries = ["AIR", "AUT", "LEI"]
url = "https://www.webwire.com/TradePublications.asp?ind={}&curpage={}"
for ind in industries:
u = url.format(ind, 1)
while True:
soup = BeautifulSoup(requests.get(u).content, "html.parser")
for li in soup.select("#syndication-list li"):
print("{:<10} {}".format(ind, li.text))
next_page = soup.select_one('a:-soup-contains("Next »")')
if next_page:
u = (
"https://www.webwire.com/TradePublications.asp"
+ next_page["href"]
)
else:
break
打印:
...
LEI Women's Wear Daily/Fairchild Financial
LEI Worcester Quarterly Magazine
LEI Word Association/Econoguide Travel Books
LEI Worldwide Spa Review
LEI Worth Magazine
LEI Y Not Girl Magazine
LEI Yankee Driver
LEI Ziff Davis Media