我可以使用 cudaMalloc 分配比必要更多的内存以避免重新分配吗？

Question

我正在编写代码，使用 cuSparse 在 GPU 上对数千个稀疏矩阵进行计算。因为 GPU 上的内存有限，我需要一个一个地处理它们，因为剩余的内存被其他 GPU 变量和密集矩阵占用了。

我的工作流程（伪代码）如下：

for (i=0;i<1000;i++){
//allocate sparse matrix using cudaMalloc
//copy sparse matrix from host using cudaMemcpy
//do calculation by calling cuSparse
//deallocate sparse matrix with cudaFree
}

在上面，我在每个步骤中为每个稀疏矩阵分配和释放内存，因为它们的稀疏性不同，因此每个所需的内存也不同。

我可以改为执行以下操作吗：

//allocate buffer once in the beginning using cudaMalloc with some extra space such 
//that even the sparse matrix with the highest density would fit.
for (i=0;i<1000;i++){
//copy sparse matrix from host using cudaMemcpy to the same buffer
//do calculation by calling cuSparse
}
//free the buffer once at the end using cudaFree

以上避免了在每次迭代中必须 malloc 和释放缓冲区。上面的方法行得通吗？它会提高性能吗？这是好的做法还是有更好的方法来做到这一点？

Answer 1

The above avoids having to malloc and free the buffer in each iteration. Would the above work?

原则上可以。

Would it improve performance?

可能吧。内存分配和释放并非没有延迟。

Is it good practice or is there a better way to do this?

一般来说，是的。许多广泛使用的 GPU 加速框架（例如 Tensorflow）都使用这种策略来降低 GPU 上的内存管理成本。是否对您的用例有好处需要您自己测试。

Answer 2

tl;dr：是的，预分配

我会比@talonmies 更直率：

cudaMalloc() 和 cudaFree() 非常慢。当您没有其他潜在的 GPU 内存竞争者时，它们也不是必需的 - 只需通过分配您期望可能使用的尽可能多的内存来“全部使用”。然后使用子分配器或使用给定板初始化的分配器在其中进行子分配。如果你使用的框架提供了这个，就使用它；否则，自己写或者找个库帮你做。

我可以使用 cudaMalloc 分配比必要更多的内存以避免重新分配吗？

Can I allocate more memory than necessary with cudaMalloc to avoid reallocating?

c

malloc

cuda

gpu

cublas

tl;dr：是的，预分配