Selenium webdriver 遍历所有页面，但只抓取第一页的数据

Question

我正在尝试从一个网站的多个页面 (36) 中抓取数据，以收集每个可用文档的文档编号和修订编号，并将其保存到两个不同的列表中。如果我运行下面的代码块用于每个单独的页面，它就可以完美运行。但是，当我添加while循环遍历所有36页时，它会循环，但只保存第一页的数据。

#sam.gov website
url = 'https://sam.gov/search/?index=sca&page=1&sort=-modifiedDate&pageSize=25&sfm%5Bstatus%5D%5Bis_active%5D=true&sfm%5BwdPreviouslyPerformedWrapper%5D%5BpreviouslyPeformed%5D=prevPerfNo%2F'

#webdriver
driver = webdriver.Chrome(options = options_, executable_path = r'C:/Users/439528/Python Scripts/Spyder/chromedriver.exe' )
driver.get(url)

#get rid of pop up window
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, '#sds-dialog-0 > button > usa-icon > i-bs > svg'))).click()

#list of revision numbers
revision_num = []

#empty list for all the WD links
WD_num = []
substring = '2015'

current_page = 0

while True:
    
    current_page += 1
    if current_page == 36:
        #find all elements on page named "field name". For each one, get the text. if the text is 'Revision Date'
        #then, get the 'sibling' element, which is the actual revision number. append the date text to the revision_num list. 
        elements = driver.find_elements_by_class_name('sds-field__name')
        wd_links = driver.find_elements_by_class_name('usa-link')
        for i in elements:
            element = i.text
            if element == 'Revision Number':
                revision_numbers = i.find_elements_by_xpath("./following-sibling::div")
                
                for x in revision_numbers:
                    a = x.text
                    revision_num.append(a)
                    
            
        #finding all links that have the partial text 2015 and putting the wd text into the WD_num list
        for link in wd_links:
            wd = link.text
            if substring in wd:
                WD_num.append(wd)
            
        
        print('Last Page Complete!')
        break
             
    else:
        #find all elements on page named "field name". For each one, get the text. if the text is 'Revision Date'
        #then, get the 'sibling' element, which is the actual revision number. append the date text to the revision_num list. 
        elements = driver.find_elements_by_class_name('sds-field__name')
        wd_links = driver.find_elements_by_class_name('usa-link')
        for i in elements:
            element = i.text
            if element == 'Revision Number':
                revision_numbers = i.find_elements_by_xpath("./following-sibling::div")
                
                for x in revision_numbers:
                    a = x.text
                    revision_num.append(a)
                    
            

        #finding all links that have the partial text 2015 and putting the wd text into the WD_num list
        for link in wd_links:
            wd = link.text
            if substring in wd:
                WD_num.append(wd)
            
        #click on next page
        click_icon = WebDriverWait(driver, 5, 0.25).until(EC.visibility_of_element_located([By.ID,'bottomPagination-nextPage']))
        click_icon.click()
        WebDriverWait(driver, 5).until(EC.presence_of_element_located((By.ID, 'main-container')))

我尝试过的事情：

我添加了 WebDriverWait 以减慢页面加载 and/or 元素的脚本速度 clickable/located
我在循环外声明了空列表，因此它不会覆盖每次迭代
我已经多次编辑 while 循环以计数到 36（while current_page <37）或将计数器移动到循环的顶部或底部）

有什么想法吗？ TIA.

编辑： 添加了 'field name' 的截图

Answer 1

我重构了你的代码，让事情变得非常简单。

    driver = webdriver.Chrome(options = options_, executable_path = r'C:/Users/439528/Python Scripts/Spyder/chromedriver.exe' )
    
    revision_num = []
    WD_num = []
    
    for page in range(1,37):
        url = 'https://sam.gov/search/?index=sca&page={}&sort=-modifiedDate&pageSize=25&sfm%5Bstatus%5D%5Bis_active%5D=true&sfm%5BwdPreviouslyPerformedWrapper%5D%5BpreviouslyPeformed%5D=prevPerfNo%2F'.format(page)
        driver.get(url)
        if page==1:
            WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, '#sds-dialog-0 > button > usa-icon > i-bs > svg'))).click()
    
        
        elements =  WebDriverWait(driver, 10).until(EC.visibility_of_all_elements_located((By.XPATH,"//a[contains(@class,'usa-link') and contains(.,'2015')]")))
        wd_links = WebDriverWait(driver, 10).until(EC.visibility_of_all_elements_located((By.XPATH,"//div[@class='sds-field__name' and text()='Revision Number']/following-sibling::div")))
        for element in elements:
            revision_num.append(element.text)
    
        for wd_link in wd_links:
            WD_num.append(wd_link.text)
    
    print(revision_num)
    print(WD_num)

如果您只知道要迭代 36 页，您可以传递 url 中的值。
使用 webdriverwait 等待元素可见
以这样的方式构建你的 xpath，这样就可以在没有 if 的情况下唯一地识别元素，但是。

我终端上的控制台输出:

Selenium webdriver 遍历所有页面，但只抓取第一页的数据

Selenium webdriver loops through all pages, but only scraping data for first page

python

selenium

loops

web-scraping

selenium-webdriver