我如何跳过列表中的元素

Question

我试图了解如何将图像 ID 添加到列表中并在下次搜索时跳过它。到目前为止，这是我的代码，我尝试了很多...机器人应该始终将他最近复制的图像添加到 'used' 黑名单中，下次不要复制它。

search = True
used = []
driver = webdriver.Chrome()

driver.get('https://9gag.com/funny')
time.sleep(2)
driver.find_element(By.XPATH,value='//*[@id="qc-cmp2-ui"]/div[2]/div/button[1]/span').click()
time.sleep(2)

while True:
    while search:

        post = driver.find_element(By.CSS_SELECTOR,value='.post-container a img')
        if post.id in used:
            search = True
        else:
            search = False



    post_url = post.get_attribute('src')
    post_title = post.get_attribute('alt')
    used.append(post.id)
    print(post_url)
    print(post_title)
    print('......')
    print(used)
    print(post.id)
    time.sleep(20)

问题：他将使用过的图像添加到列表中，但他仍然找到并复制它...

https://img-9gag-fun.9cache.com/photo/aWgNx36_460s.jpg
Fat acceptance activist on the news was so fat they had to put her in landscape mode
......
['4d17ee3f-213a-4d54-a18f-425f8f4dea4b']
4d17ee3f-213a-4d54-a18f-425f8f4dea4b
https://img-9gag-fun.9cache.com/photo/aWgNx36_460s.jpg
Fat acceptance activist on the news was so fat they had to put her in landscape mode
......
['4d17ee3f-213a-4d54-a18f-425f8f4dea4b', '4d17ee3f-213a-4d54-a18f-425f8f4dea4b']
4d17ee3f-213a-4d54-a18f-425f8f4dea4b
https://img-9gag-fun.9cache.com/photo/aWgNx36_460s.jpg
Fat acceptance activist on the news was so fat they had to put her in landscape mode
......
['4d17ee3f-213a-4d54-a18f-425f8f4dea4b', '4d17ee3f-213a-4d54-a18f-425f8f4dea4b', '4d17ee3f-213a-4d54-a18f-425f8f4dea4b']
4d17ee3f-213a-4d54-a18f-425f8f4dea4b

编辑：代码：

while True:
    driver.switch_to.window(gag_tab)

    post = driver.find_elements(By.CSS_SELECTOR,value='.post-container a img')


    for post in post:
        post_url = post.get_attribute('src')
        post_title = post.get_attribute('alt')
        #paste the the url  and title in to another site
        time.sleep(20)

错误：

Traceback (most recent call last):
  File "main.py", line 86, in <module>
    post_url = post.get_attribute('src')
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: element is not attached to the page document
  (Session info: chrome=101.0.4951.67)
Stacktrace:
Backtrace:
    Ordinal0 [0x009CB8F3+2406643]
    Ordinal0 [0x0095AF31+1945393]
    Ordinal0 [0x0084C748+837448]
    Ordinal0 [0x0084F154+848212]
    Ordinal0 [0x0084F012+847890]
    Ordinal0 [0x0084F98A+850314]
    Ordinal0 [0x008A50C9+1200329]
    Ordinal0 [0x0089427C+1131132]
    Ordinal0 [0x008A4682+1197698]
    Ordinal0 [0x00894096+1130646]
    Ordinal0 [0x0086E636+976438]
    Ordinal0 [0x0086F546+980294]
    GetHandleVerifier [0x00C39612+2498066]
    GetHandleVerifier [0x00C2C920+2445600]
    GetHandleVerifier [0x00A64F2A+579370]
    GetHandleVerifier [0x00A63D36+574774]
    Ordinal0 [0x00961C0B+1973259]
    Ordinal0 [0x00966688+1992328]
    Ordinal0 [0x00966775+1992565]
    Ordinal0 [0x0096F8D1+2029777]
    BaseThreadInitThunk [0x75B9FA29+25]
    RtlGetAppContainerNamedObjectPath [0x77C77A7E+286]
    RtlGetAppContainerNamedObjectPath [0x77C77A4E+238]

Answer 1

首先：您忘记在打印最后一个 post 后放入 search = True，因此它总是会跳过循环并打印出第一个 post。但即便如此，你还没有完成，因为 driver.find_element() 总是搜索与你的参数匹配的第一个元素，所以它会陷入无限循环，因为第一个 post 在 used 列表中并且会无休止地将 search 设置为 True。

请尝试使用 driver.find_elements()。这将创建一个包含所有 post 的列表，因此您可以循环遍历列表并像这样打印每个 post：

posts = driver.find_elements(by=By.CSS_SELECTOR, value='.post-container a img')

for post in posts:
    post_url = post.get_attribute('src')
    post_title = post.get_attribute('alt')
    used.append(post.id)
    print(post_url)
    print(post_title)
    print('......')
    print(used)
    print(post.id)
    time.sleep(2)

编辑：

由于 driver.find_elements() 只会接收目前网站上加载的 posts，因此您需要在向下滚动页面时再次调用它。这就是为什么我放入一个 while 循环并忽略已经打印的 posts 的原因。关于 StaleElementReferenceException 我放了一个 try-except 块来忽略不再可引用的元素。当您向下滚动网站的速度过快时，就会发生这种情况。您像这样导入这些异常：

from selenium.common.exceptions import StaleElementReferenceException
from selenium.common.exceptions import WebDriverException

只要确保没有命名冲突即可。

这是我目前的解决方案：

used = []

while True:
    posts = driver.find_elements(by=By.CSS_SELECTOR, value='.post-container a img')

    for post in posts:
        if not post.id in used:
            try:
                post_url = post.get_attribute('src')
                post_title = post.get_attribute('alt')
            except StaleElementReferenceException or WebDriverException:
                continue

            used.append(post.id)
            print(post_title)
            print(post_url)
            print('__________')
            time.sleep(2)

您需要手动或自动向下滚动站点（Selenium 具有驱动程序 execute_script() 的功能，您可以在其中逐渐执行滚动命令）以加载更多 post印刷。

Answer 2

变量“post”没有相对上下文（值以句点开头）。由于没有实际网页结构的描述，所以很难确定您需要的正确代码。

我发现这两个 YouTube 剪辑很有启发性：

我如何通过 PYTHON 使用 SELENIUM 来自动化 Web。 Pt1：https://www.youtube.com/watch?v=pUUhvJvs-R4
如何使用 Selenium 抓取动态网站：https://www.youtube.com/watch?v=lTypMlVBFM4

我如何跳过列表中的元素

How do i skip a element from a list

python

selenium

image

web