Python > Selenium:在 "logged-in" 环境中基于文本文件中的链接进行网络抓取
Python > Selenium: Web-scraping in a "logged-in" environment based on links from a text file
兼容 ChromeDriver
该计划旨在实现以下目标:
- Automatically sign-in to a website;
- Visit a link / link(s) from a text file;
- To scrape data from each page visited this way; and
- Output all scraped data by print().
请跳至问题区域的 第 2 部分,因为第 1 部分已经过测试,可用于第 1 步。 :)
代码:
第 1 部分
from selenium import webdriver
import time
from selenium.webdriver.common.keys import Keys
driver = webdriver.Chrome()
driver.get("https://www.website1.com/home")
main_page = driver.current_window_handle
time.sleep(5)
##cookies
driver.find_element_by_xpath('//*[@id="CybotCookiebotDialogBodyButtonAccept"]').click()
time.sleep(5)
driver.find_element_by_xpath('//*[@id ="google-login"]/span').click()
for handle in driver.window_handles:
if handle != main_page:
login_page = handle
driver.switch_to.window(login_page)
with open('logindetails.txt', 'r') as file:
for details in file:
email, password = details.split(':')
driver.find_element_by_xpath('//*[@id ="identifierId"]').send_keys(email)
driver.find_element_by_xpath('//span[text()="Next"]').click()
time.sleep(5)
driver.find_element_by_xpath('//input[@type="password"]').send_keys(password)
driver.find_element_by_xpath('//span[text()="Next"]').click()
driver.switch_to.window(main_page)
time.sleep(5)
第 2 部分
In alllinks.txt, we have the following websites:
• website1.com/otherpage/page1
• website1.com/otherpage/page2
• website1.com/otherpage/page3
with open('alllinks.txt', 'r') as directory:
for items in directory:
driver.get(items)
time.sleep(2)
elements = driver.find_elements_by_class_name('data-xl')
for element in elements:
print ([element])
time.sleep(5)
driver.quit()
结果:
[Done] exited with code=0 in 53.463 seconds
...和零输出
问题:
Location of the element has been verified, am suspecting that the windows have something to do with why the driver is not scraping.
欢迎并非常感谢所有意见。 :)
在 driver.get()
中使用的 URL 必须包含协议(即 https://
)。
driver.get('website1.com/otherpage/page1')
只会引发异常。
事实证明我错过了 "iframe",这对于不能通过 window 直接选择的元素非常重要。
iframe = driver.find_element_by_xpath("//iframe[@class='LegacyFeature__Iframe-tasygr-1> bFBhBT']")
driver.switch_to.frame(iframe)
切换到目标 iframe 后,我们然后 运行 代码查找并打印我们要查找的元素。
time.sleep(1)
elements = driver.find_elements_by_class_name('data-xl')
for element in elements:
print(element.text)
登录后,您几乎可以将 webdriver 定向到站点上的其他页面,甚至基于包含所有感兴趣链接的文本文件:
Suppose that the text file (shown below as "LINKS.txt") had the
following links:
• https://www.website.com/home/item1
• https://www.website.com/home/item2
• https://www.website.com/home/item3
with open('LINKS.txt', 'r') as directory:
for items in directory:
driver.get(items)
iframe = driver.find_element_by_xpath("//iframe[@class='LegacyFeature__Iframe-tasygr-1 bFBhBT']")
driver.switch_to.frame(iframe)
time.sleep(10)
elements = driver.find_elements_by_class_name('data-xl')
for element in elements:
print(element.text)
time.sleep(10)
上面的代码应该允许您访问页面 ...item1、...item2 和 ...item3(根据“.txt”文件),抓取元素并打印输出。
兼容 ChromeDriver
该计划旨在实现以下目标:
- Automatically sign-in to a website;
- Visit a link / link(s) from a text file;
- To scrape data from each page visited this way; and
- Output all scraped data by print().
请跳至问题区域的 第 2 部分,因为第 1 部分已经过测试,可用于第 1 步。 :)
代码:
第 1 部分
from selenium import webdriver
import time
from selenium.webdriver.common.keys import Keys
driver = webdriver.Chrome()
driver.get("https://www.website1.com/home")
main_page = driver.current_window_handle
time.sleep(5)
##cookies
driver.find_element_by_xpath('//*[@id="CybotCookiebotDialogBodyButtonAccept"]').click()
time.sleep(5)
driver.find_element_by_xpath('//*[@id ="google-login"]/span').click()
for handle in driver.window_handles:
if handle != main_page:
login_page = handle
driver.switch_to.window(login_page)
with open('logindetails.txt', 'r') as file:
for details in file:
email, password = details.split(':')
driver.find_element_by_xpath('//*[@id ="identifierId"]').send_keys(email)
driver.find_element_by_xpath('//span[text()="Next"]').click()
time.sleep(5)
driver.find_element_by_xpath('//input[@type="password"]').send_keys(password)
driver.find_element_by_xpath('//span[text()="Next"]').click()
driver.switch_to.window(main_page)
time.sleep(5)
第 2 部分
In alllinks.txt, we have the following websites:
• website1.com/otherpage/page1
• website1.com/otherpage/page2
• website1.com/otherpage/page3
with open('alllinks.txt', 'r') as directory:
for items in directory:
driver.get(items)
time.sleep(2)
elements = driver.find_elements_by_class_name('data-xl')
for element in elements:
print ([element])
time.sleep(5)
driver.quit()
结果:
[Done] exited with code=0 in 53.463 seconds
...和零输出
问题:
Location of the element has been verified, am suspecting that the windows have something to do with why the driver is not scraping.
欢迎并非常感谢所有意见。 :)
driver.get()
中使用的 URL 必须包含协议(即 https://
)。
driver.get('website1.com/otherpage/page1')
只会引发异常。
事实证明我错过了 "iframe",这对于不能通过 window 直接选择的元素非常重要。
iframe = driver.find_element_by_xpath("//iframe[@class='LegacyFeature__Iframe-tasygr-1> bFBhBT']") driver.switch_to.frame(iframe)
切换到目标 iframe 后,我们然后 运行 代码查找并打印我们要查找的元素。
time.sleep(1)
elements = driver.find_elements_by_class_name('data-xl')
for element in elements:
print(element.text)
登录后,您几乎可以将 webdriver 定向到站点上的其他页面,甚至基于包含所有感兴趣链接的文本文件:
Suppose that the text file (shown below as "LINKS.txt") had the following links:
• https://www.website.com/home/item1
• https://www.website.com/home/item2
• https://www.website.com/home/item3
with open('LINKS.txt', 'r') as directory:
for items in directory:
driver.get(items)
iframe = driver.find_element_by_xpath("//iframe[@class='LegacyFeature__Iframe-tasygr-1 bFBhBT']")
driver.switch_to.frame(iframe)
time.sleep(10)
elements = driver.find_elements_by_class_name('data-xl')
for element in elements:
print(element.text)
time.sleep(10)
上面的代码应该允许您访问页面 ...item1、...item2 和 ...item3(根据“.txt”文件),抓取元素并打印输出。