我如何跳过列表中的元素
How do i skip a element from a list
我试图了解如何将图像 ID 添加到列表中并在下次搜索时跳过它。
到目前为止,这是我的代码,我尝试了很多...机器人应该始终将他最近复制的图像添加到 'used' 黑名单中,下次不要复制它。
search = True
used = []
driver = webdriver.Chrome()
driver.get('https://9gag.com/funny')
time.sleep(2)
driver.find_element(By.XPATH,value='//*[@id="qc-cmp2-ui"]/div[2]/div/button[1]/span').click()
time.sleep(2)
while True:
while search:
post = driver.find_element(By.CSS_SELECTOR,value='.post-container a img')
if post.id in used:
search = True
else:
search = False
post_url = post.get_attribute('src')
post_title = post.get_attribute('alt')
used.append(post.id)
print(post_url)
print(post_title)
print('......')
print(used)
print(post.id)
time.sleep(20)
问题:他将使用过的图像添加到列表中,但他仍然找到并复制它...
https://img-9gag-fun.9cache.com/photo/aWgNx36_460s.jpg
Fat acceptance activist on the news was so fat they had to put her in landscape mode
......
['4d17ee3f-213a-4d54-a18f-425f8f4dea4b']
4d17ee3f-213a-4d54-a18f-425f8f4dea4b
https://img-9gag-fun.9cache.com/photo/aWgNx36_460s.jpg
Fat acceptance activist on the news was so fat they had to put her in landscape mode
......
['4d17ee3f-213a-4d54-a18f-425f8f4dea4b', '4d17ee3f-213a-4d54-a18f-425f8f4dea4b']
4d17ee3f-213a-4d54-a18f-425f8f4dea4b
https://img-9gag-fun.9cache.com/photo/aWgNx36_460s.jpg
Fat acceptance activist on the news was so fat they had to put her in landscape mode
......
['4d17ee3f-213a-4d54-a18f-425f8f4dea4b', '4d17ee3f-213a-4d54-a18f-425f8f4dea4b', '4d17ee3f-213a-4d54-a18f-425f8f4dea4b']
4d17ee3f-213a-4d54-a18f-425f8f4dea4b
编辑:
代码:
while True:
driver.switch_to.window(gag_tab)
post = driver.find_elements(By.CSS_SELECTOR,value='.post-container a img')
for post in post:
post_url = post.get_attribute('src')
post_title = post.get_attribute('alt')
#paste the the url and title in to another site
time.sleep(20)
错误:
Traceback (most recent call last):
File "main.py", line 86, in <module>
post_url = post.get_attribute('src')
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: element is not attached to the page document
(Session info: chrome=101.0.4951.67)
Stacktrace:
Backtrace:
Ordinal0 [0x009CB8F3+2406643]
Ordinal0 [0x0095AF31+1945393]
Ordinal0 [0x0084C748+837448]
Ordinal0 [0x0084F154+848212]
Ordinal0 [0x0084F012+847890]
Ordinal0 [0x0084F98A+850314]
Ordinal0 [0x008A50C9+1200329]
Ordinal0 [0x0089427C+1131132]
Ordinal0 [0x008A4682+1197698]
Ordinal0 [0x00894096+1130646]
Ordinal0 [0x0086E636+976438]
Ordinal0 [0x0086F546+980294]
GetHandleVerifier [0x00C39612+2498066]
GetHandleVerifier [0x00C2C920+2445600]
GetHandleVerifier [0x00A64F2A+579370]
GetHandleVerifier [0x00A63D36+574774]
Ordinal0 [0x00961C0B+1973259]
Ordinal0 [0x00966688+1992328]
Ordinal0 [0x00966775+1992565]
Ordinal0 [0x0096F8D1+2029777]
BaseThreadInitThunk [0x75B9FA29+25]
RtlGetAppContainerNamedObjectPath [0x77C77A7E+286]
RtlGetAppContainerNamedObjectPath [0x77C77A4E+238]
首先:您忘记在打印最后一个 post 后放入 search = True
,因此它总是会跳过循环并打印出第一个 post。但即便如此,你还没有完成,因为 driver.find_element()
总是搜索与你的参数匹配的第一个元素,所以它会陷入无限循环,因为第一个 post 在 used
列表中并且会无休止地将 search
设置为 True
。
请尝试使用 driver.find_elements()
。这将创建一个包含所有 post 的列表,因此您可以循环遍历列表并像这样打印每个 post:
posts = driver.find_elements(by=By.CSS_SELECTOR, value='.post-container a img')
for post in posts:
post_url = post.get_attribute('src')
post_title = post.get_attribute('alt')
used.append(post.id)
print(post_url)
print(post_title)
print('......')
print(used)
print(post.id)
time.sleep(2)
编辑:
由于 driver.find_elements()
只会接收目前网站上加载的 posts,因此您需要在向下滚动页面时再次调用它。这就是为什么我放入一个 while 循环并忽略已经打印的 posts 的原因。关于 StaleElementReferenceException
我放了一个 try-except
块来忽略不再可引用的元素。当您向下滚动网站的速度过快时,就会发生这种情况。您像这样导入这些异常:
from selenium.common.exceptions import StaleElementReferenceException
from selenium.common.exceptions import WebDriverException
只要确保没有命名冲突即可。
这是我目前的解决方案:
used = []
while True:
posts = driver.find_elements(by=By.CSS_SELECTOR, value='.post-container a img')
for post in posts:
if not post.id in used:
try:
post_url = post.get_attribute('src')
post_title = post.get_attribute('alt')
except StaleElementReferenceException or WebDriverException:
continue
used.append(post.id)
print(post_title)
print(post_url)
print('__________')
time.sleep(2)
您需要手动或自动向下滚动站点(Selenium 具有驱动程序 execute_script()
的功能,您可以在其中逐渐执行滚动命令)以加载更多 post印刷。
变量“post”没有相对上下文(值以句点开头)。由于没有实际网页结构的描述,所以很难确定您需要的正确代码。
我发现这两个 YouTube 剪辑很有启发性:
- 我如何通过 PYTHON 使用 SELENIUM 来自动化 Web。 Pt1:https://www.youtube.com/watch?v=pUUhvJvs-R4
- 如何使用 Selenium 抓取动态网站:https://www.youtube.com/watch?v=lTypMlVBFM4
我试图了解如何将图像 ID 添加到列表中并在下次搜索时跳过它。 到目前为止,这是我的代码,我尝试了很多...机器人应该始终将他最近复制的图像添加到 'used' 黑名单中,下次不要复制它。
search = True
used = []
driver = webdriver.Chrome()
driver.get('https://9gag.com/funny')
time.sleep(2)
driver.find_element(By.XPATH,value='//*[@id="qc-cmp2-ui"]/div[2]/div/button[1]/span').click()
time.sleep(2)
while True:
while search:
post = driver.find_element(By.CSS_SELECTOR,value='.post-container a img')
if post.id in used:
search = True
else:
search = False
post_url = post.get_attribute('src')
post_title = post.get_attribute('alt')
used.append(post.id)
print(post_url)
print(post_title)
print('......')
print(used)
print(post.id)
time.sleep(20)
问题:他将使用过的图像添加到列表中,但他仍然找到并复制它...
https://img-9gag-fun.9cache.com/photo/aWgNx36_460s.jpg
Fat acceptance activist on the news was so fat they had to put her in landscape mode
......
['4d17ee3f-213a-4d54-a18f-425f8f4dea4b']
4d17ee3f-213a-4d54-a18f-425f8f4dea4b
https://img-9gag-fun.9cache.com/photo/aWgNx36_460s.jpg
Fat acceptance activist on the news was so fat they had to put her in landscape mode
......
['4d17ee3f-213a-4d54-a18f-425f8f4dea4b', '4d17ee3f-213a-4d54-a18f-425f8f4dea4b']
4d17ee3f-213a-4d54-a18f-425f8f4dea4b
https://img-9gag-fun.9cache.com/photo/aWgNx36_460s.jpg
Fat acceptance activist on the news was so fat they had to put her in landscape mode
......
['4d17ee3f-213a-4d54-a18f-425f8f4dea4b', '4d17ee3f-213a-4d54-a18f-425f8f4dea4b', '4d17ee3f-213a-4d54-a18f-425f8f4dea4b']
4d17ee3f-213a-4d54-a18f-425f8f4dea4b
编辑: 代码:
while True:
driver.switch_to.window(gag_tab)
post = driver.find_elements(By.CSS_SELECTOR,value='.post-container a img')
for post in post:
post_url = post.get_attribute('src')
post_title = post.get_attribute('alt')
#paste the the url and title in to another site
time.sleep(20)
错误:
Traceback (most recent call last):
File "main.py", line 86, in <module>
post_url = post.get_attribute('src')
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: element is not attached to the page document
(Session info: chrome=101.0.4951.67)
Stacktrace:
Backtrace:
Ordinal0 [0x009CB8F3+2406643]
Ordinal0 [0x0095AF31+1945393]
Ordinal0 [0x0084C748+837448]
Ordinal0 [0x0084F154+848212]
Ordinal0 [0x0084F012+847890]
Ordinal0 [0x0084F98A+850314]
Ordinal0 [0x008A50C9+1200329]
Ordinal0 [0x0089427C+1131132]
Ordinal0 [0x008A4682+1197698]
Ordinal0 [0x00894096+1130646]
Ordinal0 [0x0086E636+976438]
Ordinal0 [0x0086F546+980294]
GetHandleVerifier [0x00C39612+2498066]
GetHandleVerifier [0x00C2C920+2445600]
GetHandleVerifier [0x00A64F2A+579370]
GetHandleVerifier [0x00A63D36+574774]
Ordinal0 [0x00961C0B+1973259]
Ordinal0 [0x00966688+1992328]
Ordinal0 [0x00966775+1992565]
Ordinal0 [0x0096F8D1+2029777]
BaseThreadInitThunk [0x75B9FA29+25]
RtlGetAppContainerNamedObjectPath [0x77C77A7E+286]
RtlGetAppContainerNamedObjectPath [0x77C77A4E+238]
首先:您忘记在打印最后一个 post 后放入 search = True
,因此它总是会跳过循环并打印出第一个 post。但即便如此,你还没有完成,因为 driver.find_element()
总是搜索与你的参数匹配的第一个元素,所以它会陷入无限循环,因为第一个 post 在 used
列表中并且会无休止地将 search
设置为 True
。
请尝试使用 driver.find_elements()
。这将创建一个包含所有 post 的列表,因此您可以循环遍历列表并像这样打印每个 post:
posts = driver.find_elements(by=By.CSS_SELECTOR, value='.post-container a img')
for post in posts:
post_url = post.get_attribute('src')
post_title = post.get_attribute('alt')
used.append(post.id)
print(post_url)
print(post_title)
print('......')
print(used)
print(post.id)
time.sleep(2)
编辑:
由于 driver.find_elements()
只会接收目前网站上加载的 posts,因此您需要在向下滚动页面时再次调用它。这就是为什么我放入一个 while 循环并忽略已经打印的 posts 的原因。关于 StaleElementReferenceException
我放了一个 try-except
块来忽略不再可引用的元素。当您向下滚动网站的速度过快时,就会发生这种情况。您像这样导入这些异常:
from selenium.common.exceptions import StaleElementReferenceException
from selenium.common.exceptions import WebDriverException
只要确保没有命名冲突即可。
这是我目前的解决方案:
used = []
while True:
posts = driver.find_elements(by=By.CSS_SELECTOR, value='.post-container a img')
for post in posts:
if not post.id in used:
try:
post_url = post.get_attribute('src')
post_title = post.get_attribute('alt')
except StaleElementReferenceException or WebDriverException:
continue
used.append(post.id)
print(post_title)
print(post_url)
print('__________')
time.sleep(2)
您需要手动或自动向下滚动站点(Selenium 具有驱动程序 execute_script()
的功能,您可以在其中逐渐执行滚动命令)以加载更多 post印刷。
变量“post”没有相对上下文(值以句点开头)。由于没有实际网页结构的描述,所以很难确定您需要的正确代码。
我发现这两个 YouTube 剪辑很有启发性:
- 我如何通过 PYTHON 使用 SELENIUM 来自动化 Web。 Pt1:https://www.youtube.com/watch?v=pUUhvJvs-R4
- 如何使用 Selenium 抓取动态网站:https://www.youtube.com/watch?v=lTypMlVBFM4