selenium 和 multiprocessing，收集重复页面，不收集某些页面

Question

我对下面的代码有一些问题，它收集了重复的页面并且不收集某些页面在 url 示例中，我有 19 页的分页，例如它收集第 2 页的评论和第 3 页中的相同评论，并且不收集第 3 页的评论

import requests
from urllib.parse import urljoin
from multiprocessing.pool import ThreadPool
from bs4 import BeautifulSoup as bs 
from selenium import webdriver
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait as wait
import time

def get_comments(url):
    browser = webdriver.Firefox(executable_path='geckodriver')
    browser.get(url)
    soup = bs(browser.page_source,"html.parser")

    lastPage = soup.findAll('span', class_= 'page')[-1].text

    for page in range(1,int(lastPage)+1):
        print(page)
        wait(browser, 10).until(EC.element_to_be_clickable((By.XPATH, "//span[text()='" + str(page) + "']"))).click()
        soup = bs(browser.page_source,"html.parser")
        comments = soup.findAll('div', class_ = 'commentaire-card-left')
        for comment in comments:
            print(comment.find('p').text)
            print(comment.find('cite').text)

if __name__ == '__main__':
    url = "https://www.mesopinions.com/petition/politique/stop-massacre-nos-artisans-annulez-redressement/74954/page14?commentaires-list=true"
    ThreadPool(10).map(get_comments, [url])

非常感谢

Answer 1

collects duplicate pages and does not collect certain pages

举个具体的例子，因为我是看输出的，没看出问题。

how can I open 5 tabs with selenium for 5 urls

打开 5 个 windows 比打开 5 个标签更容易。

这个：

ThreadPool(10).map(get_comments, [url])

没有达到您的预期。地图做这样的事情：

# executed in parallel
function(args[0]) # possibly in first executor
function(args[1]) # possibly in second executor
...

在你的例子中，args 列表只有 1 个元素，所以你只使用了 10 个执行器中的 1 个。

要解决此问题，您必须定义一个函数来解析特定范围的页面或单个页面：

def print_comments(url, pages):
    browser = webdriver.Firefox(executable_path="geckodriver")
    browser.get(url)
    for page in pages:
        wait(browser, 10).until(EC.element_to_be_clickable((By.XPATH, "//span[text()='" + str(page) + "']"))).click()
        soup = bs(browser.page_source, "html.parser")
        comments = soup.findAll("div", class_="commentaire-card-left")
        for comment in comments:
            print(comment.find("p").text)
            print(comment.find("cite").text)
    browser.quit()


def print_all_comments(url):
    browser = webdriver.Firefox(executable_path="geckodriver")
    browser.get(url)
    soup = bs(browser.page_source, "html.parser")
    browser.quit()

    lastPage = soup.findAll("span", class_="page")[-1].text
    step = math.ceil(int(lastPage) / 3 )
    with ThreadPoolExecutor(max_workers=3) as executor:
        for start_page in range(1, int(lastPage) + 1, step):
            page_numbers = list(range(start_page, start_page + step))
            executor.submit(print_comments, url, page_numbers)


url = "https://www.mesopinions.com/petition/politique/stop-massacre-nos-artisans-annulez-redressement/74954/page14?commentaires-list=true"
print_all_comments(url)

实施并不完美，但您应该明白这一点:)

selenium 和 multiprocessing，收集重复页面，不收集某些页面

selenium and multiprocessing, collects duplicate pages and does not collect certain pages

python

selenium

screen-scraping

beautifulsoup

multiprocessing