Python 网页抓取 - 处理页面 404 错误
Python Web Scraping - handling page 404 errors
我正在通过 Python\Selenium\Chrome 无头驱动程序执行网络抓取,其中涉及执行一个循环:
# perform loop
CustId=2000
while (CustId<=3000):
# Part 1: Customer REST call:
urlg = f'https://mywebsite.com/customerRest/show/?id={CustId}'
driver.get(urlg)
soup = BeautifulSoup(driver.page_source,"lxml")
dict_from_json = json.loads(soup.find("body").text)
#logic for webscraping is here......
CustId = CustId+1
# close driver at end of everything
driver.close()
但是,有时客户ID为特定号码时,页面可能不存在。我无法控制这一点,代码停止并出现页面未找到 404 错误。我如何忽略它并继续循环?
我猜我需要试一试....除了?
也许一种方法是尝试:
try:
urlg = f'https://mywebsite.com/customerRest/show/?id={CustId}'
driver.get(urlg)
soup = BeautifulSoup(driver.page_source,"lxml")
dict_from_json = json.loads(soup.find("body").text)
#logic for webscraping is here......
CustId = CustId+1
except:
print("404 error found, moving on")
CustId = CustId+1
抱歉,如果这不起作用,我还没有测试过。
您可以检查页面正文 h1
标记文本出现时出现的内容 404 error
然后您可以将其放在 if 子句中以检查是否没有然后进入块。
CustId=2000
while (CustId<=3000):
urlg = f'https://mywebsite.com/customerRest/show/?id={CustId}'
driver.get(urlg)
soup = BeautifulSoup(driver.page_source,"lxml")
if not "Page not found" in soup.find("body").text:
dict_from_json = json.loads(soup.find("body").text)
#logic for webscraping is here......
CustId=CustId+1
或者
CustId=2000
while (CustId<=3000):
urlg = f'https://mywebsite.com/customerRest/show/?id={CustId}'
driver.get(urlg)
soup = BeautifulSoup(driver.page_source,"lxml")
if not "404" in soup.find("body").text:
dict_from_json = json.loads(soup.find("body").text)
#logic for webscraping is here......
CustId=CustId+1
理想的方法是使用 range()
函数并在末尾使用 driver.quit()
,如下所示:
for CustId in range(2000, 3000):
try:
urlg = f'https://mywebsite.com/customerRest/show/?id={str(CustId)}'
driver.get(urlg)
if not "404" in driver.page_source:
soup = BeautifulSoup(driver.page_source,"lxml")
dict_from_json = json.loads(soup.find("body").text)
#logic for webscraping is here......
except:
continue
driver.quit()
我正在通过 Python\Selenium\Chrome 无头驱动程序执行网络抓取,其中涉及执行一个循环:
# perform loop
CustId=2000
while (CustId<=3000):
# Part 1: Customer REST call:
urlg = f'https://mywebsite.com/customerRest/show/?id={CustId}'
driver.get(urlg)
soup = BeautifulSoup(driver.page_source,"lxml")
dict_from_json = json.loads(soup.find("body").text)
#logic for webscraping is here......
CustId = CustId+1
# close driver at end of everything
driver.close()
但是,有时客户ID为特定号码时,页面可能不存在。我无法控制这一点,代码停止并出现页面未找到 404 错误。我如何忽略它并继续循环?
我猜我需要试一试....除了?
也许一种方法是尝试:
try:
urlg = f'https://mywebsite.com/customerRest/show/?id={CustId}'
driver.get(urlg)
soup = BeautifulSoup(driver.page_source,"lxml")
dict_from_json = json.loads(soup.find("body").text)
#logic for webscraping is here......
CustId = CustId+1
except:
print("404 error found, moving on")
CustId = CustId+1
抱歉,如果这不起作用,我还没有测试过。
您可以检查页面正文 h1
标记文本出现时出现的内容 404 error
然后您可以将其放在 if 子句中以检查是否没有然后进入块。
CustId=2000
while (CustId<=3000):
urlg = f'https://mywebsite.com/customerRest/show/?id={CustId}'
driver.get(urlg)
soup = BeautifulSoup(driver.page_source,"lxml")
if not "Page not found" in soup.find("body").text:
dict_from_json = json.loads(soup.find("body").text)
#logic for webscraping is here......
CustId=CustId+1
或者
CustId=2000
while (CustId<=3000):
urlg = f'https://mywebsite.com/customerRest/show/?id={CustId}'
driver.get(urlg)
soup = BeautifulSoup(driver.page_source,"lxml")
if not "404" in soup.find("body").text:
dict_from_json = json.loads(soup.find("body").text)
#logic for webscraping is here......
CustId=CustId+1
理想的方法是使用 range()
函数并在末尾使用 driver.quit()
,如下所示:
for CustId in range(2000, 3000):
try:
urlg = f'https://mywebsite.com/customerRest/show/?id={str(CustId)}'
driver.get(urlg)
if not "404" in driver.page_source:
soup = BeautifulSoup(driver.page_source,"lxml")
dict_from_json = json.loads(soup.find("body").text)
#logic for webscraping is here......
except:
continue
driver.quit()