Selenium 网络抓取:如何将一个选项卡优先于另一个选项卡

Selenium web scraping: how to prioritize a tab over another

项目:保存 https://theuselessweb.com/

中的所有 URLs/titles

测试代码(只有 3 页,打印不保存):

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from time import sleep

PATH = r"C:\Users\XXX\Documents\scraping\chromedriver.exe"
driver = webdriver.Chrome(PATH)
driver.get("https://theuselessweb.com/")
driver.switch_to.window(driver.window_handles[-1])
button = driver.find_element_by_id("button")

for i in range(3):
    button.click()
    sleep(2)
    driver.switch_to.window(driver.window_handles[-1])
    print(driver.current_url)
    print(driver.title)
    driver.close()

错误:

DevTools listening on ws://127.0.0.1:60235/devtools/browser/a5ea4ab0-fba6-4a34-b0ee-8926876c554f
[11636:4168:0626/143411.535:ERROR:device_event_log_impl.cc(214)] [14:34:11.535] USB: usb_device_handle_win.cc:1058 Failed to read descriptor from node connection: Ein an das System angeschlossenes Gerõt funktioniert nicht. (0x1F)
[11636:4168:0626/143411.552:ERROR:device_event_log_impl.cc(214)] [14:34:11.552] USB: usb_device_handle_win.cc:1058 Failed to read descriptor from node connection: Ein an das System angeschlossenes Gerõt funktioniert nicht. (0x1F)
[11636:4168:0626/143411.555:ERROR:device_event_log_impl.cc(214)] [14:34:11.555] USB: usb_device_handle_win.cc:1058 Failed to read descriptor from node connection: Ein an das System angeschlossenes Gerõt funktioniert nicht. (0x1F)
https://thatsthefinger.com/           #this is what I want
The finger, deal with it.             #this is what I want
Traceback (most recent call last):
  File "C:\Users\XXX\Documents\scraping\programs\linkscraping.py", line 16, in <module>
    button.click()
  File "C:\Users\XXX\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\selenium\webdriver\remote\webelement.py", line 80, in click
    self._execute(Command.CLICK_ELEMENT)
  File "C:\Users\XXX\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\selenium\webdriver\remote\webelement.py", line 633, in _execute
    return self._parent.execute(command, params)
  File "C:\Users\XXX\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\selenium\webdriver\remote\webdriver.py", line 321, in execute
    self.error_handler.check_response(response)
  File "C:\Users\XXX\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\selenium\webdriver\remote\errorhandler.py", line 242, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.NoSuchWindowException: Message: no such window: target window already closed
from unknown error: web view not found
  (Session info: chrome=91.0.4472.124)

它打印出第一个网站的 URL 和标题,然后崩溃。同样,每次我 运行 driver.get(ANYURL) 命令时,它都会打开 link 和 Chrome 设置 (chrome://settings/triggeredResetProfileSettings)。也许这会把事情搞砸,无论如何,如果我也能摆脱这个不需要的 window,那将非常有帮助。

这是问题的解决方案。它仍会每隔 link 打开,但由于它是无头的,因此用户不可见。

在这种情况下,X 是您要提取的随机网站的数量

代码会打开站点,然后根据 x 单击所需次数的按钮,然后继续每一个并记录结果。最后,它关闭 Chrome.

from selenium.webdriver.chrome.options import Options
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager

options = Options()
options.headless = True
driver = webdriver.Chrome(
    ChromeDriverManager().install(), 
    options=options
)

x = 10

driver.get('https://theuselessweb.com/')
button = button = driver.find_element_by_id("button")

for i in range(x):
    button.click()

for i in range(x):
    driver.switch_to.window(driver.window_handles[i+1])
    print(driver.current_url)
    print(driver.title)

driver.quit()