Python > Selenium:在 "logged-in" 环境中基于文本文件中的链接进行网络抓取

Python > Selenium: Web-scraping in a "logged-in" environment based on links from a text file

兼容 ChromeDriver

该计划旨在实现以下目标:

  1. Automatically sign-in to a website;
  2. Visit a link / link(s) from a text file;
  3. To scrape data from each page visited this way; and
  4. Output all scraped data by print().

请跳至问题区域的 第 2 部分,因为第 1 部分已经过测试,可用于第 1 步。 :)

代码:

第 1 部分

from selenium import webdriver
import time
from selenium.webdriver.common.keys import Keys

driver = webdriver.Chrome()

driver.get("https://www.website1.com/home")

main_page = driver.current_window_handle 
time.sleep(5) 

##cookies
driver.find_element_by_xpath('//*[@id="CybotCookiebotDialogBodyButtonAccept"]').click() 
time.sleep(5)

driver.find_element_by_xpath('//*[@id ="google-login"]/span').click() 
for handle in driver.window_handles: 
    if handle != main_page: 
        login_page = handle 

driver.switch_to.window(login_page) 

with open('logindetails.txt', 'r') as file:
   for details in file:
        email, password = details.split(':')

        driver.find_element_by_xpath('//*[@id ="identifierId"]').send_keys(email) 
driver.find_element_by_xpath('//span[text()="Next"]').click()

time.sleep(5)
driver.find_element_by_xpath('//input[@type="password"]').send_keys(password) 

driver.find_element_by_xpath('//span[text()="Next"]').click() 
driver.switch_to.window(main_page) 
time.sleep(5)

第 2 部分

In alllinks.txt, we have the following websites:


• website1.com/otherpage/page1
• website1.com/otherpage/page2
• website1.com/otherpage/page3

with open('alllinks.txt', 'r') as directory:
    for items in directory:
    driver.get(items)
    time.sleep(2)
    elements = driver.find_elements_by_class_name('data-xl')
    for element in elements:
            print ([element])
    time.sleep(5)


driver.quit()

结果:

[Done] exited with code=0 in 53.463 seconds

...和零输出


问题:

Location of the element has been verified, am suspecting that the windows have something to do with why the driver is not scraping.

欢迎并非常感谢所有意见。 :)

driver.get() 中使用的

URL 必须包含协议(即 https://)。

driver.get('website1.com/otherpage/page1') 只会引发异常。

事实证明我错过了 "iframe",这对于不能通过 window 直接选择的元素非常重要。

iframe = driver.find_element_by_xpath("//iframe[@class='LegacyFeature__Iframe-tasygr-1> bFBhBT']")
driver.switch_to.frame(iframe)

切换到目标 iframe 后,我们然后 运行 代码查找并打印我们要查找的元素。

time.sleep(1)

elements = driver.find_elements_by_class_name('data-xl')
for element in elements:
    print(element.text)

登录后,您几乎可以将 webdriver 定向到站点上的其他页面,甚至基于包含所有感兴趣链接的文本文件:

Suppose that the text file (shown below as "LINKS.txt") had the following links:
https://www.website.com/home/item1
https://www.website.com/home/item2
https://www.website.com/home/item3

with open('LINKS.txt', 'r') as directory:
   for items in directory:
    driver.get(items)
    iframe = driver.find_element_by_xpath("//iframe[@class='LegacyFeature__Iframe-tasygr-1 bFBhBT']")
    driver.switch_to.frame(iframe)
    time.sleep(10)
    elements = driver.find_elements_by_class_name('data-xl')
    for element in elements:
        print(element.text)

time.sleep(10)

上面的代码应该允许您访问页面 ...item1...item2 ...item3(根据“.txt”文件),抓取元素并打印输出。