Python |向下滚动并按 class 名称查找的 Selenium 问题
Python | Selenium Issue with scrolling down and find by class name
对于一项研究,我想从位于视口之外的网页中抓取一些链接(要查看此链接,您需要向下滚动页面)。
- 页面示例 (https://www.twitch.tv/lirik)
- Link 示例:https://www.amazon.com/dp/B09FVR22R2
- Link位于divclass='
Layout-sc-nxg1ff-0 itdjvg default-panel
'(页面上共有16个链接)。
我已经编写了脚本,但我得到的是空列表:
from selenium import webdriver
import time
browser = webdriver.Firefox()
browser.get('https://www.twitch.tv/lirik')
time.sleep(3)
browser.execute_script("window.scrollBy(0,document.body.scrollHeight)")
time.sleep(3)
panel_blocks = browser.find_elements(by='class name', value='Layout-sc-nxg1ff-0 itdjvg default-panel')
browser.close()
print(panel_blocks)
print(type(panel_blocks))
页面加载后我得到的是空列表。这是上面脚本的输出:
/usr/local/bin/python /Users/greg.fetisov/PycharmProjects/baltazar_platform/Twitch_parser.py
[]
<class 'list'>
Process finished with exit code 0
p.s。
当 webdriver 打开页面时,我看到没有向下滚动操作。它只是打开一个页面,然后在 time.sleep 冷却时间后将其关闭。
如何更改脚本以正确获取链接?
如有任何帮助或建议,我们将不胜感激!
要打印 href 属性的值,您必须引入 for the visibility_of_all_elements_located() and you can use either of the following :
使用CSS_SELECTOR:
driver.get("https://www.twitch.tv/lirik")
print([my_elem.get_attribute("href") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div.Layout-sc-nxg1ff-0.itdjvg.default-panel > a")))])
控制台输出:
['https://www.amazon.com/dp/B09FVR22R2', 'http://bs.serving-sys.com/Serving/adServer.bs?cn=trd&pli=1077437714&gdpr=$%7BGDPR%7D&gdpr_consent=$%7BGDPR_CONSENT_68%7D&adid=1085757156&ord=[timestamp]', 'https://store.epicgames.com/lirik/rumbleverse', 'https://bitly/3GP0cM0', 'https://lirik.com/', 'https://streamlabs.com/lirik', 'https://twitch.amazon.com/tp', 'https://www.twitch.tv/subs/lirik', 'https://www.youtube.com/lirik?sub_confirmation=1', 'http://www.twitter.com/lirik', 'http://www.instagram.com/lirik', 'http://gfuel.ly/lirik', 'http://www.cyberpowerpc.com/', 'https://www.cyberpowerpc.com/page/Intel/LIRIK/', 'https://discord.gg/lirik', 'http://www.amazon.com/?_encoding=UTF8&camp=1789&creative=390957&linkCode=ur2&tag=l0e6d-20&linkId=YNM2SXSSG3KWGYZ7']
注意:您必须添加以下导入:
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
- 您使用了错误的定位器。
- 您应该使用预期条件显式等待而不是硬编码暂停。
find_elements
方法 returns 网络元素列表,而您想要 link 元素内部。
这应该会更好:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
browser = webdriver.Firefox()
browser.get('https://www.twitch.tv/lirik')
wait = WebDriverWait(browser, 20)
wait.until(EC.element_to_be_clickable((By.XPATH, "//div[@class='channel-panels-container']//a")))
time.sleep(0.5)
link_blocks = browser.find_elements_by_xpath("//div[@class='channel-panels-container']//a")
for link_block in link_blocks:
link = link_block.get_attribute("href")
print(link)
browser.close()
对于一项研究,我想从位于视口之外的网页中抓取一些链接(要查看此链接,您需要向下滚动页面)。
- 页面示例 (https://www.twitch.tv/lirik)
- Link 示例:https://www.amazon.com/dp/B09FVR22R2
- Link位于divclass='
Layout-sc-nxg1ff-0 itdjvg default-panel
'(页面上共有16个链接)。
我已经编写了脚本,但我得到的是空列表:
from selenium import webdriver
import time
browser = webdriver.Firefox()
browser.get('https://www.twitch.tv/lirik')
time.sleep(3)
browser.execute_script("window.scrollBy(0,document.body.scrollHeight)")
time.sleep(3)
panel_blocks = browser.find_elements(by='class name', value='Layout-sc-nxg1ff-0 itdjvg default-panel')
browser.close()
print(panel_blocks)
print(type(panel_blocks))
页面加载后我得到的是空列表。这是上面脚本的输出:
/usr/local/bin/python /Users/greg.fetisov/PycharmProjects/baltazar_platform/Twitch_parser.py
[]
<class 'list'>
Process finished with exit code 0
p.s。 当 webdriver 打开页面时,我看到没有向下滚动操作。它只是打开一个页面,然后在 time.sleep 冷却时间后将其关闭。
如何更改脚本以正确获取链接?
如有任何帮助或建议,我们将不胜感激!
要打印 href 属性的值,您必须引入
使用CSS_SELECTOR:
driver.get("https://www.twitch.tv/lirik") print([my_elem.get_attribute("href") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div.Layout-sc-nxg1ff-0.itdjvg.default-panel > a")))])
控制台输出:
['https://www.amazon.com/dp/B09FVR22R2', 'http://bs.serving-sys.com/Serving/adServer.bs?cn=trd&pli=1077437714&gdpr=$%7BGDPR%7D&gdpr_consent=$%7BGDPR_CONSENT_68%7D&adid=1085757156&ord=[timestamp]', 'https://store.epicgames.com/lirik/rumbleverse', 'https://bitly/3GP0cM0', 'https://lirik.com/', 'https://streamlabs.com/lirik', 'https://twitch.amazon.com/tp', 'https://www.twitch.tv/subs/lirik', 'https://www.youtube.com/lirik?sub_confirmation=1', 'http://www.twitter.com/lirik', 'http://www.instagram.com/lirik', 'http://gfuel.ly/lirik', 'http://www.cyberpowerpc.com/', 'https://www.cyberpowerpc.com/page/Intel/LIRIK/', 'https://discord.gg/lirik', 'http://www.amazon.com/?_encoding=UTF8&camp=1789&creative=390957&linkCode=ur2&tag=l0e6d-20&linkId=YNM2SXSSG3KWGYZ7']
注意:您必须添加以下导入:
from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC
- 您使用了错误的定位器。
- 您应该使用预期条件显式等待而不是硬编码暂停。
find_elements
方法 returns 网络元素列表,而您想要 link 元素内部。
这应该会更好:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
browser = webdriver.Firefox()
browser.get('https://www.twitch.tv/lirik')
wait = WebDriverWait(browser, 20)
wait.until(EC.element_to_be_clickable((By.XPATH, "//div[@class='channel-panels-container']//a")))
time.sleep(0.5)
link_blocks = browser.find_elements_by_xpath("//div[@class='channel-panels-container']//a")
for link_block in link_blocks:
link = link_block.get_attribute("href")
print(link)
browser.close()