cupy 函数的第一个和后续 运行 的执行时间差异很大
Big difference in execution time for first and subsequent run of cupy functions
当我在 cupy 数组上 运行 cupy 函数时,函数的第一次调用比第二个 运行 花费的时间明显长,即使我 运行 它在不同的数组上第二次.
这是为什么?
import cupy as cp
cp.__version__
# 7.5.0
A = cp.random.random((1024, 1024))
B = cp.random.random((1024, 1024))
from time import time
def test(func, *args):
t = time()
func(*args)
print("{}".format(round(time() - t, 4)))
test(cp.fft.fft2, A)
test(cp.fft.fft2, B)
# 0.129
# 0.001
test(cp.matmul, A, A.T)
test(cp.matmul, B, B.T)
# 0.171
# 0.0
test(cp.linalg.inv, A)
test(cp.linalg.inv, B)
# 0.259
# 0.002
CuPy 会在您第一次在 Python 进程中使用函数时在后台实时编译内核,这需要一些时间。
来自 CuPy documentation:
CuPy uses on-the-fly kernel synthesis: when a kernel call is required,
it compiles a kernel code optimized for the shapes and dtypes of given
arguments, sends it to the GPU device, and executes the kernel. The
compiled code is cached to $(HOME)/.cupy/kernel_cache directory (this
cache path can be overwritten by setting the CUPY_CACHE_DIR
environment variable). It may make things slower at the first kernel
call, though this slow down will be resolved at the second execution.
CuPy also caches the kernel code sent to GPU device within the
process, which reduces the kernel transfer time on further calls.
根据cupy
user guide:
Context Initialization:
It may take several seconds when calling a
CuPy function for the first time in a process. This is because the
CUDA driver creates a CUDA context during the first CUDA API call in
CUDA applications.
当我在 cupy 数组上 运行 cupy 函数时,函数的第一次调用比第二个 运行 花费的时间明显长,即使我 运行 它在不同的数组上第二次.
这是为什么?
import cupy as cp
cp.__version__
# 7.5.0
A = cp.random.random((1024, 1024))
B = cp.random.random((1024, 1024))
from time import time
def test(func, *args):
t = time()
func(*args)
print("{}".format(round(time() - t, 4)))
test(cp.fft.fft2, A)
test(cp.fft.fft2, B)
# 0.129
# 0.001
test(cp.matmul, A, A.T)
test(cp.matmul, B, B.T)
# 0.171
# 0.0
test(cp.linalg.inv, A)
test(cp.linalg.inv, B)
# 0.259
# 0.002
CuPy 会在您第一次在 Python 进程中使用函数时在后台实时编译内核,这需要一些时间。
来自 CuPy documentation:
CuPy uses on-the-fly kernel synthesis: when a kernel call is required, it compiles a kernel code optimized for the shapes and dtypes of given arguments, sends it to the GPU device, and executes the kernel. The compiled code is cached to $(HOME)/.cupy/kernel_cache directory (this cache path can be overwritten by setting the CUPY_CACHE_DIR environment variable). It may make things slower at the first kernel call, though this slow down will be resolved at the second execution. CuPy also caches the kernel code sent to GPU device within the process, which reduces the kernel transfer time on further calls.
根据cupy
user guide:
Context Initialization: It may take several seconds when calling a CuPy function for the first time in a process. This is because the CUDA driver creates a CUDA context during the first CUDA API call in CUDA applications.