BeautifulSoup 和 Python 3 中的 multithreading/multiprocessing 呢?

What about multithreading/multiprocessing in BeautifulSoup and Python 3?

所以我在搞BeautifulSoup。我写了一些代码,并在您允许的情况下将其放在这里。有以下问题 - 有没有办法使用多线程或多处理来加速它?打赌这段代码远非理想:) 在这种情况下应该使用 Pool 吗?

ps。我以这个网站为例。

提前致谢。

import requests
from bs4 import BeautifulSoup
import csv
import sys
reload(sys)
sys.setdefaultencoding('utf-8')

pages = [str(i) for i in range(100,2000)]
for page in pages:
    html = requests.get('https://statesassembly.gov.je/Pages/Members.aspxMemberID='+page).text
    def get_page_data():
    soup = BeautifulSoup(html, 'lxml')
    name = soup.find('h1').text
    title = soup.find(class_='gel-layout__item gel-2/3@m gel-1/1@s').find('h2').text
    data = {'name': name,
            'title': title,
            }

    return (data)

data = get_page_data()
with open('Members.csv','a') as output_file:
    writer = csv.writer(output_file, delimiter=';')
    writer.writerow((data['name'],
                    data['title'],
                    ))

暴力破解政府网站在某些国家/地区可能是非法的。请确保您阅读了您所在国家和您要从中获取数据的国家/地区的版权法。

首先请将您的列表分成几部分,然后使它的线程并行执行它们。

Python程序说明线程的概念

import threading 
import os 

def task1(): 
    print("Task 1 assigned to thread: {}".format(threading.current_thread().name)) 
    print("ID of process running task 1: {}".format(os.getpid())) 

def task2(): 
    print("Task 2 assigned to thread: {}".format(threading.current_thread().name)) 
    print("ID of process running task 2: {}".format(os.getpid())) 

if __name__ == "__main__": 

    # print ID of current process 
    print("ID of process running main program: {}".format(os.getpid())) 

    # print name of main thread 
    print("Main thread name: {}".format(threading.main_thread().name)) 

    # creating threads 
    t1 = threading.Thread(target=task1, name='t1') 
    t2 = threading.Thread(target=task2, name='t2')   

    # starting threads 
    t1.start() 
    t2.start() 

    # wait until all threads finish 
    t1.join() 
    t2.join()