Python Selenium 中的 IndexError 和 UnboundLocalError

Question

如果这是一个基本问题，我深表歉意，因为我对 python 还是比较陌生。我正在尝试使用 Selenium 开发一个网络抓取脚本，并且已经获得了几乎所有必要的功能（从一个页面导航到另一个页面，在一个页面中定位并打开所有 url）。

但是，由于我要抓取的网站的性质，一些元素在某些页面上被省略，而在其他页面上存在。在省略它们的情况下，终端 returns 一个 IndexError，我目前正在使用异常绕过它。

然而，当我尝试打印抓取的数据时，出现以下错误：

UnboundLocalError: local variable 'manufacturer' referenced before assignment

我知道这可能是因为我对 IndexError 进行了例外处理，然后在随后的 print 命令中引用了它。

我是否可以通过规避这两个问题来抓取我正在寻找的信息，如果是这样，我该如何将抓取的数据导出到 csv 文件中？

我的函数代码如下：

def scrape():
    browser.implicitly_wait(7)
    try:
        collection = browser.find_elements(By.XPATH,'//*[@id="page-content-wrapper"]/div/ul/li[5]/a')[0].text
        description = browser.find_elements(By.XPATH,'//*[(@id = "child-1")]//p')[0].text
        dimension = browser.find_elements(By.XPATH,'//*[(@id = "detailed-description")]//div[(((count(preceding-sibling::*) + 1) = 1) and parent::*)]//p')[0].text
        finish = browser.find_elements(By.XPATH,'//*[(@id = "detailed-description")]//div[(((count(preceding-sibling::*) + 1) = 2) and parent::*)]//p')[0].text
        country = browser.find_elements(By.XPATH,'//*[(@id = "detailed-description")]//div[(((count(preceding-sibling::*) + 1) = 3) and parent::*)]//p')[0].text
        manufacturer = browser.find_elements(By.XPATH,'//div[(((count(preceding-sibling::*) + 1) = 4) and parent::*)]//p')[0].text
    except IndexError:
        pass
    print(collection, description, dimension, finish, country, manufacturer)
    browser.back()

非常感谢！

Answer 1

由于您的变量是在 try 块内创建的，如果其中一个变量失败，则该变量和该变量下方的任何内容都不会被创建，并且当您尝试引用它时，解释器不知道该变量是什么应该是。如果您执行以下操作，打印语句将起作用：

def scrape():
    browser.implicitly_wait(7)
    collection = ""
    description = ""
    dimension = ""
    finish = ""
    country = ""
    manufacturer = ""

    try:
        collection = browser.find_elements(By.XPATH,'//*[@id="page-content-wrapper"]/div/ul/li[5]/a')[0].text
        description = browser.find_elements(By.XPATH,'//*[(@id = "child-1")]//p')[0].text
        dimension = browser.find_elements(By.XPATH,'//*[(@id = "detailed-description")]//div[(((count(preceding-sibling::*) + 1) = 1) and parent::*)]//p')[0].text
        finish = browser.find_elements(By.XPATH,'//*[(@id = "detailed-description")]//div[(((count(preceding-sibling::*) + 1) = 2) and parent::*)]//p')[0].text
        country = browser.find_elements(By.XPATH,'//*[(@id = "detailed-description")]//div[(((count(preceding-sibling::*) + 1) = 3) and parent::*)]//p')[0].text
        manufacturer = browser.find_elements(By.XPATH,'//div[(((count(preceding-sibling::*) + 1) = 4) and parent::*)]//p')[0].text
    except IndexError:
        pass
    print(collection, description, dimension, finish, country, manufacturer)
    browser.back()

所以现在你必须处理失败变量下面的变量也没有被赋值的问题；我推荐使用字典：

def scrape():
    browser.implicitly_wait(7)
    page_elements = dict()
    page_elements['collection'] = ""
    page_elements['description'] = ""
    page_elements['dimension'] = ""
    page_elements['finish'] = ""
    page_elements['country'] = ""
    page_elements['manufacturer'] = ""
    try:
        collection = browser.find_elements(By.XPATH,'//*[@id="page-content-wrapper"]/div/ul/li[5]/a')[0].text
    except IndexError:
        pass
    try:
        description = browser.find_elements(By.XPATH,'//*[(@id = "child-1")]//p')[0].text
    except IndexError:
        pass
    try:
        dimension = browser.find_elements(By.XPATH,'//*[(@id = "detailed-description")]//div[(((count(preceding-sibling::*) + 1) = 1) and parent::*)]//p')[0].text
    except IndexError:
        pass
    try:
        finish = browser.find_elements(By.XPATH,'//*[(@id = "detailed-description")]//div[(((count(preceding-sibling::*) + 1) = 2) and parent::*)]//p')[0].text
    except IndexError:
        pass
    try:
        country = browser.find_elements(By.XPATH,'//*[(@id = "detailed-description")]//div[(((count(preceding-sibling::*) + 1) = 3) and parent::*)]//p')[0].text
    except IndexError:
        pass
    try:
        manufacturer = browser.find_elements(By.XPATH,'//div[(((count(preceding-sibling::*) + 1) = 4) and parent::*)]//p')[0].text
    except IndexError:
        pass
    print(page_elements)
    browser.back()

同样，为每个语句使用 try 块有点麻烦，所以如果您想将路径保存在不同的字典中，您可以这样做：

def scrape():
    browser.implicityl_wait(7)
    page_elements = dict()
    page_elements['collection'] = ""
    page_elements['description'] = ""
    page_elements['dimension'] = ""
    page_elements['finish'] = ""
    page_elements['country'] = ""
    page_elements['manufacturer'] = ""
    
    element_paths = dict()
    element_paths['collection'] = '//*[@id="page-content-wrapper"]/div/ul/li[5]/a'
    element_paths['description'] = '//*[(@id = "child-1")]//p'
    element_paths['dimension'] = '//*[(@id = "detailed-description")]//div[(((count(preceding-sibling::*) + 1) = 1) and parent::*)]//p'
    element_paths['finish'] = '//*[(@id = "detailed-description")]//div[(((count(preceding-sibling::*) + 1) = 2) and parent::*)]//p'
    element_paths['country'] = '//*[(@id = "detailed-description")]//div[(((count(preceding-sibling::*) + 1) = 3) and parent::*)]//p'
    element_paths['manufacturer'] = '//div[(((count(preceding-sibling::*) + 1) = 4) and parent::*)]//p'

    for element in page_elements:
        try:
            page_elements[element] = browser.find_elements(By.XPATH, element_paths[element])[0].text
        except IndexError:
            pass
    print(page_elements)
    browser.back()

Python Selenium 中的 IndexError 和 UnboundLocalError

Python IndexError and UnboundLocalError in Selenium

python

selenium

index-error