在 Windows 上演示 Python 中的多核加速的一些示例代码是什么?
What is some example code for demonstrating multicore speedup in Python on Windows?
我在 Windows 上使用 Python 3 并尝试构建一个玩具示例,演示如何使用多个 CPU 核心来加速计算。玩具示例是 Mandelbrot 分形的渲染。
到目前为止:
- 我避免了线程化,因为全局解释器锁在此上下文中禁止多核
- 我放弃了无法在 Windows 上运行的示例代码,因为它缺少 Linux
的分叉功能
- 正在尝试使用 "multiprocessing" 包。我声明 p=Pool(8) (8 是我的核心数)并使用 p.starmap(..) 来委派工作。这应该会产生多个 "subprocesses",其中 windows 会自动委托给不同的 CPUs
但是,我无法证明任何加速,无论是由于开销还是没有实际的多处理。因此,指向具有可证明加速的玩具示例的指针将非常有帮助:-)
编辑:谢谢!这将我推向了正确的方向,我现在有了一个工作示例,演示了在具有 4 个内核的 CPU 上速度加倍。
我的代码副本 "lecture notes" 在这里:https://pastebin.com/c9HZ2vAV
我决定使用 Pool() 但稍后会尝试使用 @16num 指出的 "Process" 替代方案。下面是 Pool() 的代码示例:
p = Pool(cpu_count())
#Unlike map, starmap only allows 1 input. "partial" provides a workaround
partial_calculatePixel = partial(calculatePixel, dataarray=data)
koord = []
for j in range(height):
for k in range(width):
koord.append((j,k))
#Runs the calls to calculatePixel in a pool. "hmm" collects the output
hmm = p.starmap(partial_calculatePixel,koord)
演示多处理加速非常简单:
import multiprocessing
import sys
import time
# multi-platform precision clock
get_timer = time.clock if sys.platform == "win32" else time.time
def cube_function(num):
time.sleep(0.01) # let's simulate it takes ~10ms for the CPU core to cube the number
return num**3
if __name__ == "__main__": # multiprocessing guard
# we'll test multiprocessing with pools from one to the number of CPU cores on the system
# it won't show significant improvements after that and it will soon start going
# downhill due to the underlying OS thread context switches
for workers in range(1, multiprocessing.cpu_count() + 1):
pool = multiprocessing.Pool(processes=workers)
# lets 'warm up' our pool so it doesn't affect our measurements
pool.map(cube_function, range(multiprocessing.cpu_count()))
# now to the business, we'll have 10000 numbers to quart via our expensive function
print("Cubing 10000 numbers over {} processes:".format(workers))
timer = get_timer() # time measuring starts now
results = pool.map(cube_function, range(10000)) # map our range to the cube_function
timer = get_timer() - timer # get our delta time as soon as it finishes
print("\tTotal: {:.2f} seconds".format(timer))
print("\tAvg. per process: {:.2f} seconds".format(timer / workers))
pool.close() # lets clear out our pool for the next run
time.sleep(1) # waiting for a second to make sure everything is cleaned up
当然,我们只是在这里模拟每个数字 10 毫秒的计算,您可以将 cube_function
替换为任何 CPU 对真实世界的演示征税。结果符合预期:
Cubing 10000 numbers over 1 processes:
Total: 100.01 seconds
Avg. per process: 100.01 seconds
Cubing 10000 numbers over 2 processes:
Total: 50.02 seconds
Avg. per process: 25.01 seconds
Cubing 10000 numbers over 3 processes:
Total: 33.36 seconds
Avg. per process: 11.12 seconds
Cubing 10000 numbers over 4 processes:
Total: 25.00 seconds
Avg. per process: 6.25 seconds
Cubing 10000 numbers over 5 processes:
Total: 20.00 seconds
Avg. per process: 4.00 seconds
Cubing 10000 numbers over 6 processes:
Total: 16.68 seconds
Avg. per process: 2.78 seconds
Cubing 10000 numbers over 7 processes:
Total: 14.32 seconds
Avg. per process: 2.05 seconds
Cubing 10000 numbers over 8 processes:
Total: 12.52 seconds
Avg. per process: 1.57 seconds
现在,为什么不是 100% 线性?好吧,首先,将数据 map/distribute 传送到子流程并将其取回需要一些时间,上下文切换有一些成本,还有其他任务使用我的 CPUs有时,time.sleep()
并不完全精确(也可能在非 RT OS 上)...但结果大致在并行处理预期的范围内。
我在 Windows 上使用 Python 3 并尝试构建一个玩具示例,演示如何使用多个 CPU 核心来加速计算。玩具示例是 Mandelbrot 分形的渲染。
到目前为止:
- 我避免了线程化,因为全局解释器锁在此上下文中禁止多核
- 我放弃了无法在 Windows 上运行的示例代码,因为它缺少 Linux 的分叉功能
- 正在尝试使用 "multiprocessing" 包。我声明 p=Pool(8) (8 是我的核心数)并使用 p.starmap(..) 来委派工作。这应该会产生多个 "subprocesses",其中 windows 会自动委托给不同的 CPUs
但是,我无法证明任何加速,无论是由于开销还是没有实际的多处理。因此,指向具有可证明加速的玩具示例的指针将非常有帮助:-)
编辑:谢谢!这将我推向了正确的方向,我现在有了一个工作示例,演示了在具有 4 个内核的 CPU 上速度加倍。
我的代码副本 "lecture notes" 在这里:https://pastebin.com/c9HZ2vAV
我决定使用 Pool() 但稍后会尝试使用 @16num 指出的 "Process" 替代方案。下面是 Pool() 的代码示例:
p = Pool(cpu_count())
#Unlike map, starmap only allows 1 input. "partial" provides a workaround
partial_calculatePixel = partial(calculatePixel, dataarray=data)
koord = []
for j in range(height):
for k in range(width):
koord.append((j,k))
#Runs the calls to calculatePixel in a pool. "hmm" collects the output
hmm = p.starmap(partial_calculatePixel,koord)
演示多处理加速非常简单:
import multiprocessing
import sys
import time
# multi-platform precision clock
get_timer = time.clock if sys.platform == "win32" else time.time
def cube_function(num):
time.sleep(0.01) # let's simulate it takes ~10ms for the CPU core to cube the number
return num**3
if __name__ == "__main__": # multiprocessing guard
# we'll test multiprocessing with pools from one to the number of CPU cores on the system
# it won't show significant improvements after that and it will soon start going
# downhill due to the underlying OS thread context switches
for workers in range(1, multiprocessing.cpu_count() + 1):
pool = multiprocessing.Pool(processes=workers)
# lets 'warm up' our pool so it doesn't affect our measurements
pool.map(cube_function, range(multiprocessing.cpu_count()))
# now to the business, we'll have 10000 numbers to quart via our expensive function
print("Cubing 10000 numbers over {} processes:".format(workers))
timer = get_timer() # time measuring starts now
results = pool.map(cube_function, range(10000)) # map our range to the cube_function
timer = get_timer() - timer # get our delta time as soon as it finishes
print("\tTotal: {:.2f} seconds".format(timer))
print("\tAvg. per process: {:.2f} seconds".format(timer / workers))
pool.close() # lets clear out our pool for the next run
time.sleep(1) # waiting for a second to make sure everything is cleaned up
当然,我们只是在这里模拟每个数字 10 毫秒的计算,您可以将 cube_function
替换为任何 CPU 对真实世界的演示征税。结果符合预期:
Cubing 10000 numbers over 1 processes:
Total: 100.01 seconds
Avg. per process: 100.01 seconds
Cubing 10000 numbers over 2 processes:
Total: 50.02 seconds
Avg. per process: 25.01 seconds
Cubing 10000 numbers over 3 processes:
Total: 33.36 seconds
Avg. per process: 11.12 seconds
Cubing 10000 numbers over 4 processes:
Total: 25.00 seconds
Avg. per process: 6.25 seconds
Cubing 10000 numbers over 5 processes:
Total: 20.00 seconds
Avg. per process: 4.00 seconds
Cubing 10000 numbers over 6 processes:
Total: 16.68 seconds
Avg. per process: 2.78 seconds
Cubing 10000 numbers over 7 processes:
Total: 14.32 seconds
Avg. per process: 2.05 seconds
Cubing 10000 numbers over 8 processes:
Total: 12.52 seconds
Avg. per process: 1.57 seconds
现在,为什么不是 100% 线性?好吧,首先,将数据 map/distribute 传送到子流程并将其取回需要一些时间,上下文切换有一些成本,还有其他任务使用我的 CPUs有时,time.sleep()
并不完全精确(也可能在非 RT OS 上)...但结果大致在并行处理预期的范围内。