使用分页抓取 table 数据

Scrape table data with pagination

尝试用 Selenium 在有分页的地方抓取 table。试图抓取的网站在 URL 中没有分页。

table = '//*[@id="result-tables"]/div[2]/div[2]/div/table/tbody'

home = driver.find_elements(By.XPATH, '//tbody/tr/td[5]')
away = driver.find_elements(By.XPATH, '//tbody/tr/td[7]')

teams = []

page = 0
while page < 10:
    page+=1
    time.sleep(5)
    for i in range(len(home)):
        temp_data = home[i].text + '\n' + away[i].text
        pair = teams.append(temp_data)

    next_page = driver.find_element(By.XPATH, '//*[@id="result-tables"]/div[3]/ul/li[12]/a/span').click()

teams = [] 只存储第一页的数据。当脚本移动到另一个页面时,得到这个错误

Traceback (most recent call last):
  File "C:\Users\XXX\OneDrive\Documents\A\b\s_pc.py", line 49, in <module>
    temp_data = home[i].text + '\n' + away[i].text
  File "C:\Users\XXX\AppData\Local\Programs\Python\Python310\lib\site-packages\selenium\webdriver\remote\webelement.py", line 76, in text
    return self._execute(Command.GET_ELEMENT_TEXT)['value']
  File "C:\Users\XXX\AppData\Local\Programs\Python\Python310\lib\site-packages\selenium\webdriver\remote\webelement.py", line 693, in _execute
    return self._parent.execute(command, params)
  File "C:\Users\XXX\AppData\Local\Programs\Python\Python310\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 418, in execute
    self.error_handler.check_response(response)
  File "C:\Users\XXX\AppData\Local\Programs\Python\Python310\lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 243, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: element is not attached to the page document
  (Session info: chrome=96.0.4664.45)
Stacktrace:

已在 while loop 中定义了 homeaway 元素。并且还在 while 循环的开头移动了 time.sleep() 。而且代码没有抛出任何错误。

检查这是否按预期工作。

table = '//*[@id="result-tables"]/div[2]/div[2]/div/table/tbody'

teams = []

page = 0
while page < 10:
    time.sleep(5)
    home = driver.find_elements(By.XPATH, '//tbody/tr/td[5]')
    away = driver.find_elements(By.XPATH, '//tbody/tr/td[7]')
    page+=1

    for i in range(len(home)):
        temp_data = home[i].text + '\n' + away[i].text
        pair = teams.append(temp_data)

    next_page = driver.find_element(By.XPATH, '//*[@id="result-tables"]/div[3]/ul/li[12]/a/span').click()