selenium 和 multiprocessing,收集重复页面,不收集某些页面
selenium and multiprocessing, collects duplicate pages and does not collect certain pages
我对下面的代码有一些问题,它收集了重复的页面并且不收集某些页面
在 url 示例中,我有 19 页的分页,例如它收集第 2 页的评论和第 3 页中的相同评论,并且不收集第 3 页的评论
import requests
from urllib.parse import urljoin
from multiprocessing.pool import ThreadPool
from bs4 import BeautifulSoup as bs
from selenium import webdriver
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait as wait
import time
def get_comments(url):
browser = webdriver.Firefox(executable_path='geckodriver')
browser.get(url)
soup = bs(browser.page_source,"html.parser")
lastPage = soup.findAll('span', class_= 'page')[-1].text
for page in range(1,int(lastPage)+1):
print(page)
wait(browser, 10).until(EC.element_to_be_clickable((By.XPATH, "//span[text()='" + str(page) + "']"))).click()
soup = bs(browser.page_source,"html.parser")
comments = soup.findAll('div', class_ = 'commentaire-card-left')
for comment in comments:
print(comment.find('p').text)
print(comment.find('cite').text)
if __name__ == '__main__':
url = "https://www.mesopinions.com/petition/politique/stop-massacre-nos-artisans-annulez-redressement/74954/page14?commentaires-list=true"
ThreadPool(10).map(get_comments, [url])
非常感谢
collects duplicate pages and does not collect certain pages
举个具体的例子,因为我是看输出的,没看出问题。
how can I open 5 tabs with selenium for 5 urls
打开 5 个 windows 比打开 5 个标签更容易。
这个:
ThreadPool(10).map(get_comments, [url])
没有达到您的预期。地图做这样的事情:
# executed in parallel
function(args[0]) # possibly in first executor
function(args[1]) # possibly in second executor
...
在你的例子中,args 列表只有 1 个元素,所以你只使用了 10 个执行器中的 1 个。
要解决此问题,您必须定义一个函数来解析特定范围的页面或单个页面:
def print_comments(url, pages):
browser = webdriver.Firefox(executable_path="geckodriver")
browser.get(url)
for page in pages:
wait(browser, 10).until(EC.element_to_be_clickable((By.XPATH, "//span[text()='" + str(page) + "']"))).click()
soup = bs(browser.page_source, "html.parser")
comments = soup.findAll("div", class_="commentaire-card-left")
for comment in comments:
print(comment.find("p").text)
print(comment.find("cite").text)
browser.quit()
def print_all_comments(url):
browser = webdriver.Firefox(executable_path="geckodriver")
browser.get(url)
soup = bs(browser.page_source, "html.parser")
browser.quit()
lastPage = soup.findAll("span", class_="page")[-1].text
step = math.ceil(int(lastPage) / 3 )
with ThreadPoolExecutor(max_workers=3) as executor:
for start_page in range(1, int(lastPage) + 1, step):
page_numbers = list(range(start_page, start_page + step))
executor.submit(print_comments, url, page_numbers)
url = "https://www.mesopinions.com/petition/politique/stop-massacre-nos-artisans-annulez-redressement/74954/page14?commentaires-list=true"
print_all_comments(url)
实施并不完美,但您应该明白这一点:)
我对下面的代码有一些问题,它收集了重复的页面并且不收集某些页面 在 url 示例中,我有 19 页的分页,例如它收集第 2 页的评论和第 3 页中的相同评论,并且不收集第 3 页的评论
import requests
from urllib.parse import urljoin
from multiprocessing.pool import ThreadPool
from bs4 import BeautifulSoup as bs
from selenium import webdriver
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait as wait
import time
def get_comments(url):
browser = webdriver.Firefox(executable_path='geckodriver')
browser.get(url)
soup = bs(browser.page_source,"html.parser")
lastPage = soup.findAll('span', class_= 'page')[-1].text
for page in range(1,int(lastPage)+1):
print(page)
wait(browser, 10).until(EC.element_to_be_clickable((By.XPATH, "//span[text()='" + str(page) + "']"))).click()
soup = bs(browser.page_source,"html.parser")
comments = soup.findAll('div', class_ = 'commentaire-card-left')
for comment in comments:
print(comment.find('p').text)
print(comment.find('cite').text)
if __name__ == '__main__':
url = "https://www.mesopinions.com/petition/politique/stop-massacre-nos-artisans-annulez-redressement/74954/page14?commentaires-list=true"
ThreadPool(10).map(get_comments, [url])
非常感谢
collects duplicate pages and does not collect certain pages
举个具体的例子,因为我是看输出的,没看出问题。
how can I open 5 tabs with selenium for 5 urls
打开 5 个 windows 比打开 5 个标签更容易。
这个:
ThreadPool(10).map(get_comments, [url])
没有达到您的预期。地图做这样的事情:
# executed in parallel
function(args[0]) # possibly in first executor
function(args[1]) # possibly in second executor
...
在你的例子中,args 列表只有 1 个元素,所以你只使用了 10 个执行器中的 1 个。
要解决此问题,您必须定义一个函数来解析特定范围的页面或单个页面:
def print_comments(url, pages):
browser = webdriver.Firefox(executable_path="geckodriver")
browser.get(url)
for page in pages:
wait(browser, 10).until(EC.element_to_be_clickable((By.XPATH, "//span[text()='" + str(page) + "']"))).click()
soup = bs(browser.page_source, "html.parser")
comments = soup.findAll("div", class_="commentaire-card-left")
for comment in comments:
print(comment.find("p").text)
print(comment.find("cite").text)
browser.quit()
def print_all_comments(url):
browser = webdriver.Firefox(executable_path="geckodriver")
browser.get(url)
soup = bs(browser.page_source, "html.parser")
browser.quit()
lastPage = soup.findAll("span", class_="page")[-1].text
step = math.ceil(int(lastPage) / 3 )
with ThreadPoolExecutor(max_workers=3) as executor:
for start_page in range(1, int(lastPage) + 1, step):
page_numbers = list(range(start_page, start_page + step))
executor.submit(print_comments, url, page_numbers)
url = "https://www.mesopinions.com/petition/politique/stop-massacre-nos-artisans-annulez-redressement/74954/page14?commentaires-list=true"
print_all_comments(url)
实施并不完美,但您应该明白这一点:)