如何使用具有多个 URL(输入)的 Selenium、Bs4 和 Docx 将 Python WebScrape 到多个输出 Docx 文件中?
How to WebScrape with Python using Selenium, Bs4 & Docx with Multiple URLs(Input) into Multiple Output Docx Files?
我一直在研究如何使用 Selenium、BS4 和 Docx 抓取多个 URL 的一些不同解决方案,到目前为止我已经能够抓取 1 URL 到准确提取我想要的内容,还能够将输出导出到单个 docx 文件。只是当涉及到多个或 1+ 时 URL 我遇到了麻烦。
目前,我正在使用下面的代码来抓取内容。
而且我想创建一个循环来抓取,开始,只是 2 个网页或多个 url,并计算它何时可以循环遍历那些我可以将列表附加到其他 URL我有。
我想将每个 url content/output 导出到每个单独的 docx 文件。
代码如下:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver import ActionChains
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
import requests
import time
import docx
import os
doc = docx.Document()
url =["https://www.udemy.com/course/python-the-complete-python-developer-course/",
"https://www.udemy.com/course/the-creative-html5-css3-course-build-awesome-websites/"]
#output = [1,2]
list1 = url
for item in url:
try:
PATH = "C:\Program Files (x86)\chromedriver.exe"
driver = webdriver.Chrome(PATH)
driver.get(item)
except:
# if the link cant be scraped
break
time.sleep(5)
button = driver.find_element_by_xpath("//div/div[@class='curriculum--sub-header--23ncD']/button[@class='udlite-btn udlite-btn-medium udlite-btn-ghost udlite-heading-sm']")
button.click()
time.sleep(5)
html = driver.page_source
soup = BeautifulSoup(html,'html.parser')
main = soup.find_all('div',{'class' : 'section--panel--1tqxC panel--panel--3NYBX'})
for mains in main:
header = mains.find_all("span",{'class' : 'section--section-title--8blTh'})
for title in header:
outputtitle = title.text
doc.add_heading(outputtitle,1)
for titles in header:
sub = mains.find_all('div',{'class' : 'section--row--3PNBT'})
for a in sub:
sub1 = a.find_all("span")
for sub in sub1:
outputsub = sub.text
doc.add_heading(outputsub,3)
for i in range(len(list1)):
doc.save("file%s.docx" %i)
创建一个将存储链接的列表
links = []
使用 try except 语句遍历它们
for item in links:
try:
# open item
except:
# if the link cant be scraped
break
# scrape the link
with open(f'{item.replace(".", "").replace("/", "")}', 'w') as file:
file.write(scraped_info)
我一直在研究如何使用 Selenium、BS4 和 Docx 抓取多个 URL 的一些不同解决方案,到目前为止我已经能够抓取 1 URL 到准确提取我想要的内容,还能够将输出导出到单个 docx 文件。只是当涉及到多个或 1+ 时 URL 我遇到了麻烦。
目前,我正在使用下面的代码来抓取内容。
而且我想创建一个循环来抓取,开始,只是 2 个网页或多个 url,并计算它何时可以循环遍历那些我可以将列表附加到其他 URL我有。
我想将每个 url content/output 导出到每个单独的 docx 文件。
代码如下:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver import ActionChains
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
import requests
import time
import docx
import os
doc = docx.Document()
url =["https://www.udemy.com/course/python-the-complete-python-developer-course/",
"https://www.udemy.com/course/the-creative-html5-css3-course-build-awesome-websites/"]
#output = [1,2]
list1 = url
for item in url:
try:
PATH = "C:\Program Files (x86)\chromedriver.exe"
driver = webdriver.Chrome(PATH)
driver.get(item)
except:
# if the link cant be scraped
break
time.sleep(5)
button = driver.find_element_by_xpath("//div/div[@class='curriculum--sub-header--23ncD']/button[@class='udlite-btn udlite-btn-medium udlite-btn-ghost udlite-heading-sm']")
button.click()
time.sleep(5)
html = driver.page_source
soup = BeautifulSoup(html,'html.parser')
main = soup.find_all('div',{'class' : 'section--panel--1tqxC panel--panel--3NYBX'})
for mains in main:
header = mains.find_all("span",{'class' : 'section--section-title--8blTh'})
for title in header:
outputtitle = title.text
doc.add_heading(outputtitle,1)
for titles in header:
sub = mains.find_all('div',{'class' : 'section--row--3PNBT'})
for a in sub:
sub1 = a.find_all("span")
for sub in sub1:
outputsub = sub.text
doc.add_heading(outputsub,3)
for i in range(len(list1)):
doc.save("file%s.docx" %i)
创建一个将存储链接的列表
links = []
使用 try except 语句遍历它们
for item in links:
try:
# open item
except:
# if the link cant be scraped
break
# scrape the link
with open(f'{item.replace(".", "").replace("/", "")}', 'w') as file:
file.write(scraped_info)