为什么我的带线程的 numpy 代码不是并行的?
Why is my numpy code with threading not parallel?
我需要对几个点邻域的栅格(矩阵)执行一些计算。我的想法是在并行线程中进行这些计算,然后汇总生成的栅格。我的问题是执行似乎不是 运行 并行的。当我将点数乘以 2 时,执行时间将延长 2 倍。我做错了什么?
from threading import Lock, Thread
import numpy as np
import time
SIZE = 1000000
THREADS = 8
my_lock=Lock()
results = np.zeros(SIZE,dtype=np.float64)
def do_job(j):
global results
s_time = time.time()
print("Starting... "+str(j))
#do some calculations
c_r=np.zeros(SIZE,dtype=np.float64)
for i in range(SIZE):
c_r[i]=np.exp(-0.001*i)
print("\t Calculation at job "+str(j)+" lasted: {:3.3f}".format(time.time()-s_time))
#sum up the results
if my_lock.acquire(blocking=True):
results = np.add(results,c_r)
my_lock.release()
print("\t Job "+str(j)+" lasted: {:3.3f}".format(time.time()-s_time))
def main():
global THREADS
s_time = time.time()
threads=[]
while THREADS>0:
p = Thread(target=do_job,args=(THREADS,))
threads.append(p)
p.start()
THREADS = THREADS-1
print("Start finished after : {:3.3f}".format(time.time()-s_time))
for p in threads:
p.join()
print("Total run diuration: {:3.3f}".format(time.time()-s_time))
if __name__ == "__main__":
main()
当我 运行 THREADS=4 的代码时,我得到:
Starting... 4
Starting... 3
Starting... 2
Starting... 1
Start finished after : 0.069
Calculation at job 4 lasted: 5.805
Job 4 lasted: 5.887
Calculation at job 3 lasted: 6.230
Job 3 lasted: 6.237
Calculation at job 1 lasted: 6.585
Job 1 lasted: 6.595
Calculation at job 2 lasted: 6.737
Job 2 lasted: 6.738
Total run diuration: 6.760
当我切换到 THREADS = 8 时,执行时间大约加倍:
Starting... 8
Starting... 7
Starting... 6
Starting... 5
Starting... 4
Starting... 3
Starting... 1
Start finished after : 0.182
Starting... 2
Calculation at job 7 lasted: 11.883
Job 7 lasted: 11.939
Calculation at job 8 lasted: 13.096
Job 8 lasted: 13.144
Calculation at job 1 lasted: 13.548
Job 1 lasted: 13.576
Calculation at job 3 lasted: 13.723
Job 3 lasted: 13.748
Calculation at job 2 lasted: 14.231
Job 2 lasted: 14.268
Calculation at job 5 lasted: 14.698
Job 5 lasted: 14.708
Calculation at job 4 lasted: 15.000
Job 4 lasted: 15.015
Calculation at job 6 lasted: 15.133
Job 6 lasted: 15.135
Total run diuration: 15.136
您遇到了全局解释器锁 (GIL),请参阅 https://wiki.python.org/moin/GlobalInterpreterLock。
当时只有一个"thread"可以进入翻译。
您的代码主要在 for i in range(SIZE)
循环内运行,由 Python 解释器执行。上下文切换只能在 IO 操作或调用 C 函数(释放 GIL)时发生。此外,与线程执行的操作相比,线程之间的切换成本很大。这就是为什么添加更多线程会减慢执行速度的原因。
根据 numpy 文档,许多操作释放 GIL,因此如果您矢量化您的操作强制程序在 numpy 中花费更多时间,您可以从线程中获益。
参见post:
尝试修改自:
for i in range(SIZE):
c_r[i]=np.exp(-0.001*i)
至:
c_r = np.exp(-0.001*np.arange(SIZE))
我需要对几个点邻域的栅格(矩阵)执行一些计算。我的想法是在并行线程中进行这些计算,然后汇总生成的栅格。我的问题是执行似乎不是 运行 并行的。当我将点数乘以 2 时,执行时间将延长 2 倍。我做错了什么?
from threading import Lock, Thread
import numpy as np
import time
SIZE = 1000000
THREADS = 8
my_lock=Lock()
results = np.zeros(SIZE,dtype=np.float64)
def do_job(j):
global results
s_time = time.time()
print("Starting... "+str(j))
#do some calculations
c_r=np.zeros(SIZE,dtype=np.float64)
for i in range(SIZE):
c_r[i]=np.exp(-0.001*i)
print("\t Calculation at job "+str(j)+" lasted: {:3.3f}".format(time.time()-s_time))
#sum up the results
if my_lock.acquire(blocking=True):
results = np.add(results,c_r)
my_lock.release()
print("\t Job "+str(j)+" lasted: {:3.3f}".format(time.time()-s_time))
def main():
global THREADS
s_time = time.time()
threads=[]
while THREADS>0:
p = Thread(target=do_job,args=(THREADS,))
threads.append(p)
p.start()
THREADS = THREADS-1
print("Start finished after : {:3.3f}".format(time.time()-s_time))
for p in threads:
p.join()
print("Total run diuration: {:3.3f}".format(time.time()-s_time))
if __name__ == "__main__":
main()
当我 运行 THREADS=4 的代码时,我得到:
Starting... 4
Starting... 3
Starting... 2
Starting... 1
Start finished after : 0.069
Calculation at job 4 lasted: 5.805
Job 4 lasted: 5.887
Calculation at job 3 lasted: 6.230
Job 3 lasted: 6.237
Calculation at job 1 lasted: 6.585
Job 1 lasted: 6.595
Calculation at job 2 lasted: 6.737
Job 2 lasted: 6.738
Total run diuration: 6.760
当我切换到 THREADS = 8 时,执行时间大约加倍:
Starting... 8
Starting... 7
Starting... 6
Starting... 5
Starting... 4
Starting... 3
Starting... 1
Start finished after : 0.182
Starting... 2
Calculation at job 7 lasted: 11.883
Job 7 lasted: 11.939
Calculation at job 8 lasted: 13.096
Job 8 lasted: 13.144
Calculation at job 1 lasted: 13.548
Job 1 lasted: 13.576
Calculation at job 3 lasted: 13.723
Job 3 lasted: 13.748
Calculation at job 2 lasted: 14.231
Job 2 lasted: 14.268
Calculation at job 5 lasted: 14.698
Job 5 lasted: 14.708
Calculation at job 4 lasted: 15.000
Job 4 lasted: 15.015
Calculation at job 6 lasted: 15.133
Job 6 lasted: 15.135
Total run diuration: 15.136
您遇到了全局解释器锁 (GIL),请参阅 https://wiki.python.org/moin/GlobalInterpreterLock。
当时只有一个"thread"可以进入翻译。
您的代码主要在 for i in range(SIZE)
循环内运行,由 Python 解释器执行。上下文切换只能在 IO 操作或调用 C 函数(释放 GIL)时发生。此外,与线程执行的操作相比,线程之间的切换成本很大。这就是为什么添加更多线程会减慢执行速度的原因。
根据 numpy 文档,许多操作释放 GIL,因此如果您矢量化您的操作强制程序在 numpy 中花费更多时间,您可以从线程中获益。
参见post:
尝试修改自:
for i in range(SIZE):
c_r[i]=np.exp(-0.001*i)
至:
c_r = np.exp(-0.001*np.arange(SIZE))