Selenium 在循环中打开和处理链接的最有效方法是什么？

Question

我正在使用 Selenium 和 BeautifulSoap 抓取网页。一般来说，这很好用。请在下面找到代码。

在此页面上列出了一些类别。深度为4级。在每个级别上我有 20 items/links.

我的问题是：在循环中打开和处理这些链接的最有效方法是什么？

import sys
sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver')
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from bs4 import BeautifulSoup
import time

options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
wd = webdriver.Chrome('chromedriver',options=options)

wd.get("url")

source = wd.page_source
soup = BeautifulSoup(source, "html.parser")
items = soup.select('ul[data-card-id="tree-list0972"]')
for item in items:
  ul = item.find('ul')
  for li in ul:
    print(li.a.get('href') + ',' + li.a.text)
    cats = webdriver.Chrome('chromedriver',options=options)
    options = webdriver.ChromeOptions()
    options.add_argument('--headless')
    options.add_argument('--no-sandbox')
    options.add_argument('--disable-dev-shm-usage')

    # Here i do need to open the link from the url list (3 levels deep)
    cats.get(h + domain + li.a.get('href'))

    WebDriverWait(webdriver, timeout=3)
    cats.close 
wd.close

Answer 1

我可能会尝试在没有 BeautifulSoap 的情况下以如下结构实现您的用例：

1.创建网络驱动程序

wd = webdriver.Chrome('chromedriver',options=options)

2。打开“主”网页

wd.get("url")

3。获取所有元素

elements = wd.find_elements_by_css_selector('ul[data-card-id="..."])

4.获取每个元素的url

pages = []
for element in elements:
   pages.append(element.get_attribute('href')

5.处理每一页

for page in pages:
   wd.get(page)
   # ...

Selenium 在循环中打开和处理链接的最有效方法是什么？

What is the most efficient way in Selenium to open and process links within a loop?

python

beautifulsoup

selenium-chromedriver

selenium-webdriver