带请求的并发多线程

Question

我正在尝试弄清楚如何在使用请求库的同时使用多线程创建并发请求。我想从 url 的 POST 请求中获取链接和总页数。

但是，我正在遍历一个非常大的循环，因此需要很长时间。我所尝试的似乎并没有使请求并发，也没有产生输出。

这是我尝试过的：

#smaller subset of my data

df = {'links': ['https://www.theparking.eu/used-cars/#!/used-cars/%3Fid_pays%3D6%26id_marque%3D138%26id_modele%3D687',
'https://www.theparking.eu/used-cars/#!/used-cars/%3Fid_pays%3D6%26id_marque%3D138%26id_modele%3D492',
'https://www.theparking.eu/used-cars/#!/used-cars/%3Fid_pays%3D6%26id_marque%3D138%26id_modele%3D499',
'https://www.theparking.eu/used-cars/#!/used-cars/%3Fid_pays%3D6%26id_marque%3D138%26id_modele%3D702',
'https://www.theparking.eu/used-cars/#!/used-cars/%3Fid_pays%3D6%26id_marque%3D138%26id_modele%3D6143'],
 'make': [138.0,138.0,138.0,138.0,138.0],
 'model': [687.0,492.0,499.0,702.0,6143.0],
 'country_id': [6.0,6.0,6.0,6.0,6.0]}

import requests
from bs4 import BeautifulSoup
from multiprocessing.pool import ThreadPool
import threading
import gc



def get_links(url):
    headers = {
        'authority': 'www.theparking.eu',
        'sec-ch-ua': '" Not;A Brand";v="99", "Google Chrome";v="97", "Chromium";v="97"',
        'accept': '*/*',
        'content-type': 'application/x-www-form-urlencoded; charset=UTF-8',
        'x-requested-with': 'XMLHttpRequest',
        'sec-ch-ua-mobile': '?0',
        'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36',
        'sec-ch-ua-platform': '"macOS"',
        'origin': 'https://www.theparking.eu',
        'sec-fetch-site': 'same-origin',
        'sec-fetch-mode': 'cors',
        'sec-fetch-dest': 'empty',
        'referer': 'https://www.theparking.eu/used-cars/used-cars/',
        'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
    }
    formal_data = defaultdict(list)
    for id_ in df['country_id']:
        for make in df['make']:
            for model in df['model']:
                data = {
                    'ajax': '{"tab_id":"t0","cur_page":1,"cur_trie":"distance","query":"","critere":{"id_pays":[%s],"id_marque":[%s], "id_modele":[%s]},"sliders":{"prix":{"id":"#range_prix","face":"prix","max_counter":983615,"min":"1","max":"400000"},"km":{"id":"#range_km","face":"km","max_counter":1071165,"min":"1","max":"500000"},"millesime":{"id":"#range_millesime","face":"millesime","max_counter":1163610,"min":"1900","max":"2022"}},"req_num":1,"nb_results":"11795660","current_location_distance":-1,"logged_in":false}' % (round(id_), round(make), round(model)),
                    'tabs': '["t0"]'
                            }
                response = requests.post(url, headers=headers, data=data)
                test = json.loads(response.text)
                pages = round(int(test['context']['nb_results'])/27)
                if pages != 0:
                    formal_data['total_pages'].append(pages)
                    formal_data['links'].append(url)
                    print(f'You are on this link:{url}')
    return formal_data
threadLocal = threading.local()

with ThreadPool(8) as pool:
    urls = df['links']
    pool.map(get_links, urls)
    # must be done before terminate is explicitly or implicitly called on the pool:
    del threadLocal
    gc.collect()

Answer 1

所以有一件事是，对于像使用 web api 这样的程序，它们是 I/O 绑定的（这里的性能损失正在等待来自另一个 machine/server/etc 的请求），更通用的方法是使用异步编程。一个很好的异步 http 请求库是 httpx (there are others as well). you'll find interface of these libraries similar to requests along with allowing to be able to do async or sync, so should be easy transition to use. from there will want to learn about async pogramming in python. the quickstart and async 以及其他好的教程可以通过 google 在一般 python 异步编程中找到。

可以看出这是其他 python http 包装器库的方法 asyncpraw

同样快速说明为什么异步优于多处理。是：

async 本质上允许单个 process/thread 执行程序的其他部分作为等待输出的其他部分，所以基本上感觉好像所有代码都在 parrellel
多处理实际上是在启动单独的进程（我解释了一下，但这就是要点）并且可能不会像在异步中那样获得相同的性能提升。

Answer 2

请注意，更现代的异步使用 requests 的方法是使用其他库，例如 requests-threads.

通过您的方法，您可以并行连接到多个 URL，但按顺序连接到每个 URL。因此，您可能没有充分利用多线程。实际上，对于 df['links'] 中的单个 URL，您将获得与单个线程相同的结果。避免这种情况的最简单方法是使用 itertools.product，它会生成一个迭代器，否则将是嵌套循环。

import requests
from concurrent.futures import ThreadPoolExecutor as ThreadPool
from itertools import product

#   ... snipped df definition ...

def get_links(packed_pars):
    url, id_, make, model = packed_pars
    headers = {
        'authority': 'www.theparking.eu',
        'sec-ch-ua': '" Not;A Brand";v="99", "Google Chrome";v="97","Chromium";v="97"',
        'accept': '*/*',
        'content-type': 'application/x-www-form-urlencoded; charset=UTF-8',
        'x-requested-with': 'XMLHttpRequest',
        'sec-ch-ua-mobile': '?0',
        'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36',
        'sec-ch-ua-platform': '"macOS"',
        'origin': 'https://www.theparking.eu',
        'sec-fetch-site': 'same-origin',
        'sec-fetch-mode': 'cors',
        'sec-fetch-dest': 'empty',
        'referer': 'https://www.theparking.eu/used-cars/used-cars/',
        'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
    }
    data = {
        'ajax': '{"tab_id":"t0","cur_page":1,"cur_trie":"distance","query":"","critere":{"id_pays":[%s],"id_marque":[%s], "id_modele":[%s]},"sliders":{"prix":{"id":"#range_prix","face":"prix","max_counter":983615,"min":"1","max":"400000"},"km":{"id":"#range_km","face":"km","max_counter":1071165,"min":"1","max":"500000"},"millesime":{"id":"#range_millesime","face":"millesime","max_counter":1163610,"min":"1900","max":"2022"}},"req_num":1,"nb_results":"11795660","current_location_distance":-1,"logged_in":false}' % (round(id_), round(make), round(model)),
        'tabs': '["t0"]'
    }
    response = requests.post(url, headers=headers, data=data)
    test = response.json()
    pages = round(int(test['context']['nb_results'])/27)
    if pages != 0:
        print(f'You are on this link:{url}, with {pages} pages')
    else:
        print("no pages")
    return url, pages


with ThreadPool(8) as pool:
    rv = pool.map(get_links, product(df['links'], df['country_id'], df['make'],
                                     df['model']))
    # This converts rv to the dict of the original post:
    formal_data = dict()
    filtered_list = [(url, pages) for url, pages in rv if pages]
    if filtered_list:
        formal_data['links'], formal_data['total_pages'] = zip(*filtered_list)
    else:  # Protect against empty answers
        formal_data['links'], formal_data['total_pages'] = [], []

至于为什么这没有产生任何输出：最后根据问题中提供的数据，test['context']['nb_results'] 每次都是 0。即使使用完整的数据集，您的查询很可能每次 returns 零项。

其他一些评论：

不推荐使用m̀ultiprocessing.pool.ThreadPool：您应该改用concurrent.futures.ThreadPoolExecutor。
您根本没有使用 threadLocal：它可以被删除。我不知道你会用它做什么。
您正在导入 threading 但未使用它。
request 响应有一个立即解析文本的 json 方法：在这种情况下不需要导入 json。
您很可能想要 ceil 而不是 round 页数。
由于您正在等待 I/O，使用比可用内核更多的线程是可以的。

带请求的并发多线程

Concurrency multithreading with requests

python

multithreading

python-requests