python 中的线程池没有预期的那么快

Question

我是 python 和机器学习的初学者。我正在尝试使用多线程重现 countvectorizer() 的代码。我正在使用 yelp 数据集使用 LogisticRegression 进行情绪分析。这是我到目前为止写的：

代码片段：

from multiprocessing.dummy import Pool as ThreadPool
from threading import Thread, current_thread
from functools import partial
data = df['text']
rev = df['stars'] 


y = []
def product_helper(args):
    return featureExtraction(*args)


def featureExtraction(p,t):     
    temp = [0] * len(bag_of_words)
    for word in p.split():
        if word in bag_of_words:
            temp[bag_of_words.index(word)] += 1

    return temp


# function to be mapped over
def calculateParallel(threads): 
    pool = ThreadPool(threads)
    job_args = [(item_a, rev[i]) for i, item_a in enumerate(data)]
    l = pool.map(product_helper,job_args)
    pool.close()
    pool.join()
    return l

temp_X = calculateParallel(12)

这里只是部分代码。

解释：

df['text'] 包含所有评论，df['stars'] 包含评级（1 到 5）。我正在尝试使用多线程查找字数向量 temp_X。 bag_of_words 是一些常用词的列表。

问题：

在没有多线程的情况下，我能够在大约 24 分钟内计算出 temp_X，而上面的代码对于大小为 100k 的评论的数据集花费了 33 分钟。我的机器有 128GB 的 DRAM 和 12 个内核（6 个具有超线程的物理内核，即每个内核的线程数 = 2）。

我做错了什么？

Answer 1

Python 线程并不能很好地协同工作。有一个称为 GIL（全局交互锁）的已知问题。基本上，interperter 中有一个锁，使所有线程都不会运行并行（即使你有多个 cpu 核心）。 Python 将简单地给每个线程几毫秒的 cpu 时间一个接一个（它变慢的原因是这些线程之间的上下文切换的开销）。

这是一份非常好的文档，解释了它是如何工作的：http://www.dabeaz.com/python/UnderstandingGIL.pdf

为了解决您的问题，我建议您尝试多处理： https://pymotw.com/2/multiprocessing/basics.html

注意：多处理并非 100% 等同于多线程。多进程将运行并行，但不同的进程不会共享内存，因此如果您更改其中一个进程中的变量，则不会在另一个进程中更改。

Answer 2

您的整个代码似乎是 CPU Bound 而不是 IO Bound。您只是在使用 GIL 下的 threads 如此有效运行仅一个线程plus overheads.It 运行s only on one core.To 运行 on multiple cores use

使用

import multiprocessing
pool = multiprocessing.Pool()
l = pool.map_async(product_helper,job_args)

from multiprocessing.dummy import Pool as ThreadPool 只是 thread 的包装器 module.It 仅使用 one core 而不是更多。

python 中的线程池没有预期的那么快

Threadpool in python is not as fast as expected

python

multithreading

machine-learning

feature-extraction