Python 网页抓取 - 处理页面 404 错误

Question

我正在通过 Python\Selenium\Chrome 无头驱动程序执行网络抓取，其中涉及执行一个循环：

# perform loop

CustId=2000
while (CustId<=3000):
  

  # Part 1: Customer REST call:
  urlg = f'https://mywebsite.com/customerRest/show/?id={CustId}'
  driver.get(urlg)

  soup = BeautifulSoup(driver.page_source,"lxml")

  dict_from_json = json.loads(soup.find("body").text)

  #logic for webscraping is here......

  CustId = CustId+1

  # close driver at end of everything

driver.close()

但是，有时客户ID为特定号码时，页面可能不存在。我无法控制这一点，代码停止并出现页面未找到 404 错误。我如何忽略它并继续循环？

我猜我需要试一试....除了？

Answer 1

也许一种方法是尝试：

try:
    urlg = f'https://mywebsite.com/customerRest/show/?id={CustId}'
    driver.get(urlg)

    soup = BeautifulSoup(driver.page_source,"lxml")

    dict_from_json = json.loads(soup.find("body").text)

    #logic for webscraping is here......

    CustId = CustId+1
except:   
    print("404 error found, moving on")
    CustId = CustId+1

抱歉，如果这不起作用，我还没有测试过。

Answer 2

您可以检查页面正文 h1 标记文本出现时出现的内容 404 error 然后您可以将其放在 if 子句中以检查是否没有然后进入块。

CustId=2000
while (CustId<=3000):
    urlg = f'https://mywebsite.com/customerRest/show/?id={CustId}'
    driver.get(urlg)
    soup = BeautifulSoup(driver.page_source,"lxml")
    if not "Page not found" in soup.find("body").text:     
      dict_from_json = json.loads(soup.find("body").text)
      #logic for webscraping is here......

    CustId=CustId+1

或者

CustId=2000
while (CustId<=3000):
    urlg = f'https://mywebsite.com/customerRest/show/?id={CustId}'
    driver.get(urlg)
    soup = BeautifulSoup(driver.page_source,"lxml")
    if not "404" in soup.find("body").text:     
      dict_from_json = json.loads(soup.find("body").text)
      #logic for webscraping is here......

    CustId=CustId+1

Answer 3

理想的方法是使用 range() 函数并在末尾使用 driver.quit()，如下所示：

for CustId in range(2000, 3000):
    try:
        urlg = f'https://mywebsite.com/customerRest/show/?id={str(CustId)}'
        driver.get(urlg)
        if not "404" in driver.page_source:
            soup = BeautifulSoup(driver.page_source,"lxml")
            dict_from_json = json.loads(soup.find("body").text)
            #logic for webscraping is here......
except:
        continue
driver.quit()

Python 网页抓取 - 处理页面 404 错误

Python Web Scraping - handling page 404 errors

python

selenium

web-scraping

selenium-webdriver