奇怪的 cuBLAS gemm 批处理性能
Strange cuBLAS gemm batched performance
我注意到 cublasSgemmStridedBatched 有一些奇怪的表现,我正在寻找解释。矩阵大小固定为 20x20。以下是几个不同批次大小的一些时序(只有乘法,没有数据传输):
- 批次 = 100,时间 = 0.2 毫秒
- 批次 = 1,000,时间 = 1.9 毫秒
- 批次 = 10,000,时间 = 18.3 毫秒
- 批次 = 100,000,时间 = 5.3 毫秒
- 批次 = 1,000,000,时间 = 52.8 毫秒
前几个批大小如我所料,随着批大小增加十倍,时间线性增加。然而,使用 100,000 个矩阵突然发生了 3.4 倍的加速?
如果矩阵大小固定为 10x10 并再次执行试验,我发现:
- 批次 = 100,时间 = 0.2 毫秒
- 批次 = 1,000,时间 = 2.0 毫秒
- 批次 = 10,000,时间 = 20.0 毫秒
- 批次 = 100,000,时间 = 0.9 毫秒
- 批次 = 1,000,000,时间 = 8.9 毫秒
同样,在 100,000 批处理大小时速度惊人地提高了 22 倍?让我想知道为什么批量大小为 1,000 和 10,000 比批量大小为 100,000 慢,因为矩阵大小仍然是 10x10。
是否针对不同的批量大小使用不同的算法?这种表现我觉得很奇怪。当我使用 cublasSgemmBatched 进行试验时,会出现类似的结果。
这些试验在 GeForce GTX 1080 Ti 上执行。授予最小工作代码:
#include <stdio.h>
#include <stdlib.h>
#include "math.h"
#include "cublas_v2.h"
//nvcc -lcublas cublas.c -o cublas.out
int main(int argc, char* argv[])
{
int i,j,k,index;
// Linear dimension of matrices
int dim = 20;
int batch_count = 10*10*10*10*10*1;
// Allocate host storage for batch_count A,B,C square matrices
float* h_A = malloc(sizeof(float) * dim * dim * batch_count);
float* h_B = malloc(sizeof(float) * dim * dim * batch_count);
float* h_C = malloc(sizeof(float) * dim * dim * batch_count);
for(k=0; k<batch_count; k++) {
for(j=0; j<dim; j++) {
for(i=0; i<dim; i++) {
index = i*dim + j + k*dim*dim;
h_A[index] = index*index + 0.0f;
h_B[index] = index + 1.0f;
h_C[index] = 0.0f;
}
}
}
float *d_A, *d_B, *d_C;
cudaMalloc(&d_A, sizeof(float) * dim * dim * batch_count);
cudaMalloc(&d_B, sizeof(float) * dim * dim * batch_count);
cudaMalloc(&d_C, sizeof(float) * dim * dim * batch_count);
cudaMemcpy(h_A,d_A,sizeof(float) * dim * dim * batch_count,cudaMemcpyDeviceToHost);
cudaMemcpy(h_B,d_B,sizeof(float) * dim * dim * batch_count,cudaMemcpyDeviceToHost);
cudaMemcpy(h_C,d_C,sizeof(float) * dim * dim * batch_count,cudaMemcpyDeviceToHost);
cublasHandle_t handle;
cublasCreate(&handle);
// Do the actual multiplication
float time_cuda_event;
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop) ;
cudaEventRecord(start, 0);
float alpha = 1.0f; float beta = 1.0f;
cublasSgemmStridedBatched(handle,
CUBLAS_OP_N,
CUBLAS_OP_N,
dim, dim, dim,
&alpha,
(const float*)d_A, dim,
dim*dim,
(const float*)d_B, dim,
dim*dim,
&beta,
d_C, dim,
dim*dim,
batch_count);
( cudaEventRecord(stop, 0) );
( cudaEventSynchronize(stop) );
( cudaEventElapsedTime(&time_cuda_event, start, stop) );
printf("Time : %3.1f ms \n", time_cuda_event);
cudaMemcpy(h_C,d_C,sizeof(float) * dim * dim * batch_count,cudaMemcpyDeviceToHost);
// Destroy the handle
cublasDestroy(handle);
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
free(h_A);
free(h_B);
free(h_C);
return 0;
}
这似乎只是 CUBLAS 内启发式的结果。如果我 运行 你的代码的修改(和工作)版本,我得到 5x5 案例的这些时间:
Batch size : 10 Time : 0.019104 ms
Batch size : 100 Time : 0.038304 ms
Batch size : 1000 Time : 0.163520 ms
Batch size : 10000 Time : 1.410944 ms
Batch size : 100000 Time : 1.614144 ms
Batch size : 1000000 Time : 16.057407 ms
分析表明,在多达 10000 个条目的批次中,库 运行 一个内核:
1.10759s 16.831us (1 1 10) (128 1 1) 120 12.250KB 0B - - - - GeForce GTX 970 1 7 maxwell_sgemm_128x64_nn [3939]
1.10766s 19.168us (1 1 100) (128 1 1) 120 12.250KB 0B - - - - GeForce GTX 970 1 7 maxwell_sgemm_128x64_nn [3971]
1.10773s 147.71us (1 1 1000) (128 1 1) 120 12.250KB 0B - - - - GeForce GTX 970 1 7 maxwell_sgemm_128x64_nn [4003]
1.10791s 1.4064ms (1 1 10000) (128 1 1) 120 12.250KB 0B - - - - GeForce GTX 970 1 7 maxwell_sgemm_128x64_nn [4035]
在更大的尺寸下,它 运行 多次调用另一个内核来为调用服务:
1.10935s 1.1518ms (1 1 65535) (16 16 1) 31 2.1250KB 0B - - - - GeForce GTX 970 1 7 void batch_gemm_kernel1x1_core<float, float, float, bool=0, bool=0, bool=0, bool=0, bool=0, bool=1, bool=1>(float* const *, float const * const *, float const * const *, float*, float const *, float const *, int, int, int, int, int, int, __int64, __int64, __int64, float const *, float const *, float, float, int, int) [4063]
1.11050s 606.54us (1 1 34465) (16 16 1) 31 2.1250KB 0B - - - - GeForce GTX 970 1 7 void batch_gemm_kernel1x1_core<float, float, float, bool=0, bool=0, bool=0, bool=0, bool=0, bool=1, bool=1>(float* const *, float const * const *, float const * const *, float*, float const *, float const *, int, int, int, int, int, int, __int64, __int64, __int64, float const *, float const *, float, float, int, int) [4087]
1.11113s 1.1498ms (1 1 65535) (16 16 1) 31 2.1250KB 0B - - - - GeForce GTX 970 1 7 void batch_gemm_kernel1x1_core<float, float, float, bool=0, bool=0, bool=0, bool=0, bool=0, bool=1, bool=1>(float* const *, float const * const *, float const * const *, float*, float const *, float const *, int, int, int, int, int, int, __int64, __int64, __int64, float const *, float const *, float, float, int, int) [4115]
1.11228s 1.1501ms (1 1 65535) (16 16 1) 31 2.1250KB 0B - - - - GeForce GTX 970 1 7 void batch_gemm_kernel1x1_core<float, float, float, bool=0, bool=0, bool=0, bool=0, bool=0, bool=1, bool=1>(float* const *, float const * const *, float const * const *, float*, float const *, float const *, int, int, int, int, int, int, __int64, __int64, __int64, float const *, float const *, float, float, int, int) [4139]
1.11344s 1.1511ms (1 1 65535) (16 16 1) 31 2.1250KB 0B - - - - GeForce GTX 970 1 7 void batch_gemm_kernel1x1_core<float, float, float, bool=0, bool=0, bool=0, bool=0, bool=0, bool=1, bool=1>(float* const *, float const * const *, float const * const *, float*, float const *, float const *, int, int, int, int, int, int, __int64, __int64, __int64, float const *, float const *, float, float, int, int) [4163]
1.11459s 1.1494ms (1 1 65535) (16 16 1) 31 2.1250KB 0B - - - - GeForce GTX 970 1 7 void batch_gemm_kernel1x1_core<float, float, float, bool=0, bool=0, bool=0, bool=0, bool=0, bool=1, bool=1>(float* const *, float const * const *, float const * const *, float*, float const *, float const *, int, int, int, int, int, int, __int64, __int64, __int64, float const *, float const *, float, float, int, int) [4187]
1.11574s 1.1507ms (1 1 65535) (16 16 1) 31 2.1250KB 0B - - - - GeForce GTX 970 1 7 void batch_gemm_kernel1x1_core<float, float, float, bool=0, bool=0, bool=0, bool=0, bool=0, bool=1, bool=1>(float* const *, float const * const *, float const * const *, float*, float const *, float const *, int, int, int, int, int, int, __int64, __int64, __int64, float const *, float const *, float, float, int, int) [4211]
1.11689s 1.1503ms (1 1 65535) (16 16 1) 31 2.1250KB 0B - - - - GeForce GTX 970 1 7 void batch_gemm_kernel1x1_core<float, float, float, bool=0, bool=0, bool=0, bool=0, bool=0, bool=1, bool=1>(float* const *, float const * const *, float const * const *, float*, float const *, float const *, int, int, int, int, int, int, __int64, __int64, __int64, float const *, float const *, float, float, int, int) [4235]
1.11804s 1.1499ms (1 1 65535) (16 16 1) 31 2.1250KB 0B - - - - GeForce GTX 970 1 7 void batch_gemm_kernel1x1_core<float, float, float, bool=0, bool=0, bool=0, bool=0, bool=0, bool=1, bool=1>(float* const *, float const * const *, float const * const *, float*, float const *, float const *, int, int, int, int, int, int, __int64, __int64, __int64, float const *, float const *, float, float, int, int) [4259]
1.11919s 1.1507ms (1 1 65535) (16 16 1) 31 2.1250KB 0B - - - - GeForce GTX 970 1 7 void batch_gemm_kernel1x1_core<float, float, float, bool=0, bool=0, bool=0, bool=0, bool=0, bool=1, bool=1>(float* const *, float const * const *, float const * const *, float*, float const *, float const *, int, int, int, int, int, int, __int64, __int64, __int64, float const *, float const *, float, float, int, int) [4283]
1.12035s 1.1507ms (1 1 65535) (16 16 1) 31 2.1250KB 0B - - - - GeForce GTX 970 1 7 void batch_gemm_kernel1x1_core<float, float, float, bool=0, bool=0, bool=0, bool=0, bool=0, bool=1, bool=1>(float* const *, float const * const *, float const * const *, float*, float const *, float const *, int, int, int, int, int, int, __int64, __int64, __int64, float const *, float const *, float, float, int, int) [4307]
1.12150s 1.1509ms (1 1 65535) (16 16 1) 31 2.1250KB 0B - - - - GeForce GTX 970 1 7 void batch_gemm_kernel1x1_core<float, float, float, bool=0, bool=0, bool=0, bool=0, bool=0, bool=1, bool=1>(float* const *, float const * const *, float const * const *, float*, float const *, float const *, int, int, int, int, int, int, __int64, __int64, __int64, float const *, float const *, float, float, int, int) [4331]
1.12265s 1.1489ms (1 1 65535) (16 16 1) 31 2.1250KB 0B - - - - GeForce GTX 970 1 7 void batch_gemm_kernel1x1_core<float, float, float, bool=0, bool=0, bool=0, bool=0, bool=0, bool=1, bool=1>(float* const *, float const * const *, float const * const *, float*, float const *, float const *, int, int, int, int, int, int, __int64, __int64, __int64, float const *, float const *, float, float, int, int) [4355]
1.12380s 1.1496ms (1 1 65535) (16 16 1) 31 2.1250KB 0B - - - - GeForce GTX 970 1 7 void batch_gemm_kernel1x1_core<float, float, float, bool=0, bool=0, bool=0, bool=0, bool=0, bool=1, bool=1>(float* const *, float const * const *, float const * const *, float*, float const *, float const *, int, int, int, int, int, int, __int64, __int64, __int64, float const *, float const *, float, float, int, int) [4379]
1.12495s 1.1500ms (1 1 65535) (16 16 1) 31 2.1250KB 0B - - - - GeForce GTX 970 1 7 void batch_gemm_kernel1x1_core<float, float, float, bool=0, bool=0, bool=0, bool=0, bool=0, bool=1, bool=1>(float* const *, float const * const *, float const * const *, float*, float const *, float const *, int, int, int, int, int, int, __int64, __int64, __int64, float const *, float const *, float, float, int, int) [4403]
1.12610s 1.1494ms (1 1 65535) (16 16 1) 31 2.1250KB 0B - - - - GeForce GTX 970 1 7 void batch_gemm_kernel1x1_core<float, float, float, bool=0, bool=0, bool=0, bool=0, bool=0, bool=1, bool=1>(float* const *, float const * const *, float const * const *, float*, float const *, float const *, int, int, int, int, int, int, __int64, __int64, __int64, float const *, float const *, float, float, int, int) [4427]
1.12726s 1.1503ms (1 1 65535) (16 16 1) 31 2.1250KB 0B - - - - GeForce GTX 970 1 7 void batch_gemm_kernel1x1_core<float, float, float, bool=0, bool=0, bool=0, bool=0, bool=0, bool=1, bool=1>(float* const *, float const * const *, float const * const *, float*, float const *, float const *, int, int, int, int, int, int, __int64, __int64, __int64, float const *, float const *, float, float, int, int) [4451]
1.12841s 299.35us (1 1 16975) (16 16 1) 31 2.1250KB 0B - - - - GeForce GTX 970 1 7 void batch_gemm_kernel1x1_core<float, float, float, bool=0, bool=0, bool=0, bool=0, bool=0, bool=1, bool=1>(float* const *, float const * const *, float const * const *, float*, float const *, float const *, int, int, int, int, int, int, __int64, __int64, __int64, float const *, float const *, float, float, int, int) [4475]
您观察到的不一致似乎是由库中一个内核更改为另一个内核引起的,这可能是由某些批处理大小标准造成的。您可以看到两个内核似乎每个批次项目都使用一个块,较大尺寸的内核使用 256 个线程的 2D 块,而较小尺寸的内核使用 128 个线程的 1D 块。除此之外,性能差异取决于内部实现细节。尽管这样做可能违反了最终用户许可,但如果您想了解更多信息,您将需要反汇编内核并查看它们的工作原理。该工具包包含执行此操作所需的所有工具,但我不建议您这样做。
我注意到 cublasSgemmStridedBatched 有一些奇怪的表现,我正在寻找解释。矩阵大小固定为 20x20。以下是几个不同批次大小的一些时序(只有乘法,没有数据传输):
- 批次 = 100,时间 = 0.2 毫秒
- 批次 = 1,000,时间 = 1.9 毫秒
- 批次 = 10,000,时间 = 18.3 毫秒
- 批次 = 100,000,时间 = 5.3 毫秒
- 批次 = 1,000,000,时间 = 52.8 毫秒
前几个批大小如我所料,随着批大小增加十倍,时间线性增加。然而,使用 100,000 个矩阵突然发生了 3.4 倍的加速?
如果矩阵大小固定为 10x10 并再次执行试验,我发现:
- 批次 = 100,时间 = 0.2 毫秒
- 批次 = 1,000,时间 = 2.0 毫秒
- 批次 = 10,000,时间 = 20.0 毫秒
- 批次 = 100,000,时间 = 0.9 毫秒
- 批次 = 1,000,000,时间 = 8.9 毫秒
同样,在 100,000 批处理大小时速度惊人地提高了 22 倍?让我想知道为什么批量大小为 1,000 和 10,000 比批量大小为 100,000 慢,因为矩阵大小仍然是 10x10。
是否针对不同的批量大小使用不同的算法?这种表现我觉得很奇怪。当我使用 cublasSgemmBatched 进行试验时,会出现类似的结果。 这些试验在 GeForce GTX 1080 Ti 上执行。授予最小工作代码:
#include <stdio.h>
#include <stdlib.h>
#include "math.h"
#include "cublas_v2.h"
//nvcc -lcublas cublas.c -o cublas.out
int main(int argc, char* argv[])
{
int i,j,k,index;
// Linear dimension of matrices
int dim = 20;
int batch_count = 10*10*10*10*10*1;
// Allocate host storage for batch_count A,B,C square matrices
float* h_A = malloc(sizeof(float) * dim * dim * batch_count);
float* h_B = malloc(sizeof(float) * dim * dim * batch_count);
float* h_C = malloc(sizeof(float) * dim * dim * batch_count);
for(k=0; k<batch_count; k++) {
for(j=0; j<dim; j++) {
for(i=0; i<dim; i++) {
index = i*dim + j + k*dim*dim;
h_A[index] = index*index + 0.0f;
h_B[index] = index + 1.0f;
h_C[index] = 0.0f;
}
}
}
float *d_A, *d_B, *d_C;
cudaMalloc(&d_A, sizeof(float) * dim * dim * batch_count);
cudaMalloc(&d_B, sizeof(float) * dim * dim * batch_count);
cudaMalloc(&d_C, sizeof(float) * dim * dim * batch_count);
cudaMemcpy(h_A,d_A,sizeof(float) * dim * dim * batch_count,cudaMemcpyDeviceToHost);
cudaMemcpy(h_B,d_B,sizeof(float) * dim * dim * batch_count,cudaMemcpyDeviceToHost);
cudaMemcpy(h_C,d_C,sizeof(float) * dim * dim * batch_count,cudaMemcpyDeviceToHost);
cublasHandle_t handle;
cublasCreate(&handle);
// Do the actual multiplication
float time_cuda_event;
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop) ;
cudaEventRecord(start, 0);
float alpha = 1.0f; float beta = 1.0f;
cublasSgemmStridedBatched(handle,
CUBLAS_OP_N,
CUBLAS_OP_N,
dim, dim, dim,
&alpha,
(const float*)d_A, dim,
dim*dim,
(const float*)d_B, dim,
dim*dim,
&beta,
d_C, dim,
dim*dim,
batch_count);
( cudaEventRecord(stop, 0) );
( cudaEventSynchronize(stop) );
( cudaEventElapsedTime(&time_cuda_event, start, stop) );
printf("Time : %3.1f ms \n", time_cuda_event);
cudaMemcpy(h_C,d_C,sizeof(float) * dim * dim * batch_count,cudaMemcpyDeviceToHost);
// Destroy the handle
cublasDestroy(handle);
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
free(h_A);
free(h_B);
free(h_C);
return 0;
}
这似乎只是 CUBLAS 内启发式的结果。如果我 运行 你的代码的修改(和工作)版本,我得到 5x5 案例的这些时间:
Batch size : 10 Time : 0.019104 ms
Batch size : 100 Time : 0.038304 ms
Batch size : 1000 Time : 0.163520 ms
Batch size : 10000 Time : 1.410944 ms
Batch size : 100000 Time : 1.614144 ms
Batch size : 1000000 Time : 16.057407 ms
分析表明,在多达 10000 个条目的批次中,库 运行 一个内核:
1.10759s 16.831us (1 1 10) (128 1 1) 120 12.250KB 0B - - - - GeForce GTX 970 1 7 maxwell_sgemm_128x64_nn [3939]
1.10766s 19.168us (1 1 100) (128 1 1) 120 12.250KB 0B - - - - GeForce GTX 970 1 7 maxwell_sgemm_128x64_nn [3971]
1.10773s 147.71us (1 1 1000) (128 1 1) 120 12.250KB 0B - - - - GeForce GTX 970 1 7 maxwell_sgemm_128x64_nn [4003]
1.10791s 1.4064ms (1 1 10000) (128 1 1) 120 12.250KB 0B - - - - GeForce GTX 970 1 7 maxwell_sgemm_128x64_nn [4035]
在更大的尺寸下,它 运行 多次调用另一个内核来为调用服务:
1.10935s 1.1518ms (1 1 65535) (16 16 1) 31 2.1250KB 0B - - - - GeForce GTX 970 1 7 void batch_gemm_kernel1x1_core<float, float, float, bool=0, bool=0, bool=0, bool=0, bool=0, bool=1, bool=1>(float* const *, float const * const *, float const * const *, float*, float const *, float const *, int, int, int, int, int, int, __int64, __int64, __int64, float const *, float const *, float, float, int, int) [4063]
1.11050s 606.54us (1 1 34465) (16 16 1) 31 2.1250KB 0B - - - - GeForce GTX 970 1 7 void batch_gemm_kernel1x1_core<float, float, float, bool=0, bool=0, bool=0, bool=0, bool=0, bool=1, bool=1>(float* const *, float const * const *, float const * const *, float*, float const *, float const *, int, int, int, int, int, int, __int64, __int64, __int64, float const *, float const *, float, float, int, int) [4087]
1.11113s 1.1498ms (1 1 65535) (16 16 1) 31 2.1250KB 0B - - - - GeForce GTX 970 1 7 void batch_gemm_kernel1x1_core<float, float, float, bool=0, bool=0, bool=0, bool=0, bool=0, bool=1, bool=1>(float* const *, float const * const *, float const * const *, float*, float const *, float const *, int, int, int, int, int, int, __int64, __int64, __int64, float const *, float const *, float, float, int, int) [4115]
1.11228s 1.1501ms (1 1 65535) (16 16 1) 31 2.1250KB 0B - - - - GeForce GTX 970 1 7 void batch_gemm_kernel1x1_core<float, float, float, bool=0, bool=0, bool=0, bool=0, bool=0, bool=1, bool=1>(float* const *, float const * const *, float const * const *, float*, float const *, float const *, int, int, int, int, int, int, __int64, __int64, __int64, float const *, float const *, float, float, int, int) [4139]
1.11344s 1.1511ms (1 1 65535) (16 16 1) 31 2.1250KB 0B - - - - GeForce GTX 970 1 7 void batch_gemm_kernel1x1_core<float, float, float, bool=0, bool=0, bool=0, bool=0, bool=0, bool=1, bool=1>(float* const *, float const * const *, float const * const *, float*, float const *, float const *, int, int, int, int, int, int, __int64, __int64, __int64, float const *, float const *, float, float, int, int) [4163]
1.11459s 1.1494ms (1 1 65535) (16 16 1) 31 2.1250KB 0B - - - - GeForce GTX 970 1 7 void batch_gemm_kernel1x1_core<float, float, float, bool=0, bool=0, bool=0, bool=0, bool=0, bool=1, bool=1>(float* const *, float const * const *, float const * const *, float*, float const *, float const *, int, int, int, int, int, int, __int64, __int64, __int64, float const *, float const *, float, float, int, int) [4187]
1.11574s 1.1507ms (1 1 65535) (16 16 1) 31 2.1250KB 0B - - - - GeForce GTX 970 1 7 void batch_gemm_kernel1x1_core<float, float, float, bool=0, bool=0, bool=0, bool=0, bool=0, bool=1, bool=1>(float* const *, float const * const *, float const * const *, float*, float const *, float const *, int, int, int, int, int, int, __int64, __int64, __int64, float const *, float const *, float, float, int, int) [4211]
1.11689s 1.1503ms (1 1 65535) (16 16 1) 31 2.1250KB 0B - - - - GeForce GTX 970 1 7 void batch_gemm_kernel1x1_core<float, float, float, bool=0, bool=0, bool=0, bool=0, bool=0, bool=1, bool=1>(float* const *, float const * const *, float const * const *, float*, float const *, float const *, int, int, int, int, int, int, __int64, __int64, __int64, float const *, float const *, float, float, int, int) [4235]
1.11804s 1.1499ms (1 1 65535) (16 16 1) 31 2.1250KB 0B - - - - GeForce GTX 970 1 7 void batch_gemm_kernel1x1_core<float, float, float, bool=0, bool=0, bool=0, bool=0, bool=0, bool=1, bool=1>(float* const *, float const * const *, float const * const *, float*, float const *, float const *, int, int, int, int, int, int, __int64, __int64, __int64, float const *, float const *, float, float, int, int) [4259]
1.11919s 1.1507ms (1 1 65535) (16 16 1) 31 2.1250KB 0B - - - - GeForce GTX 970 1 7 void batch_gemm_kernel1x1_core<float, float, float, bool=0, bool=0, bool=0, bool=0, bool=0, bool=1, bool=1>(float* const *, float const * const *, float const * const *, float*, float const *, float const *, int, int, int, int, int, int, __int64, __int64, __int64, float const *, float const *, float, float, int, int) [4283]
1.12035s 1.1507ms (1 1 65535) (16 16 1) 31 2.1250KB 0B - - - - GeForce GTX 970 1 7 void batch_gemm_kernel1x1_core<float, float, float, bool=0, bool=0, bool=0, bool=0, bool=0, bool=1, bool=1>(float* const *, float const * const *, float const * const *, float*, float const *, float const *, int, int, int, int, int, int, __int64, __int64, __int64, float const *, float const *, float, float, int, int) [4307]
1.12150s 1.1509ms (1 1 65535) (16 16 1) 31 2.1250KB 0B - - - - GeForce GTX 970 1 7 void batch_gemm_kernel1x1_core<float, float, float, bool=0, bool=0, bool=0, bool=0, bool=0, bool=1, bool=1>(float* const *, float const * const *, float const * const *, float*, float const *, float const *, int, int, int, int, int, int, __int64, __int64, __int64, float const *, float const *, float, float, int, int) [4331]
1.12265s 1.1489ms (1 1 65535) (16 16 1) 31 2.1250KB 0B - - - - GeForce GTX 970 1 7 void batch_gemm_kernel1x1_core<float, float, float, bool=0, bool=0, bool=0, bool=0, bool=0, bool=1, bool=1>(float* const *, float const * const *, float const * const *, float*, float const *, float const *, int, int, int, int, int, int, __int64, __int64, __int64, float const *, float const *, float, float, int, int) [4355]
1.12380s 1.1496ms (1 1 65535) (16 16 1) 31 2.1250KB 0B - - - - GeForce GTX 970 1 7 void batch_gemm_kernel1x1_core<float, float, float, bool=0, bool=0, bool=0, bool=0, bool=0, bool=1, bool=1>(float* const *, float const * const *, float const * const *, float*, float const *, float const *, int, int, int, int, int, int, __int64, __int64, __int64, float const *, float const *, float, float, int, int) [4379]
1.12495s 1.1500ms (1 1 65535) (16 16 1) 31 2.1250KB 0B - - - - GeForce GTX 970 1 7 void batch_gemm_kernel1x1_core<float, float, float, bool=0, bool=0, bool=0, bool=0, bool=0, bool=1, bool=1>(float* const *, float const * const *, float const * const *, float*, float const *, float const *, int, int, int, int, int, int, __int64, __int64, __int64, float const *, float const *, float, float, int, int) [4403]
1.12610s 1.1494ms (1 1 65535) (16 16 1) 31 2.1250KB 0B - - - - GeForce GTX 970 1 7 void batch_gemm_kernel1x1_core<float, float, float, bool=0, bool=0, bool=0, bool=0, bool=0, bool=1, bool=1>(float* const *, float const * const *, float const * const *, float*, float const *, float const *, int, int, int, int, int, int, __int64, __int64, __int64, float const *, float const *, float, float, int, int) [4427]
1.12726s 1.1503ms (1 1 65535) (16 16 1) 31 2.1250KB 0B - - - - GeForce GTX 970 1 7 void batch_gemm_kernel1x1_core<float, float, float, bool=0, bool=0, bool=0, bool=0, bool=0, bool=1, bool=1>(float* const *, float const * const *, float const * const *, float*, float const *, float const *, int, int, int, int, int, int, __int64, __int64, __int64, float const *, float const *, float, float, int, int) [4451]
1.12841s 299.35us (1 1 16975) (16 16 1) 31 2.1250KB 0B - - - - GeForce GTX 970 1 7 void batch_gemm_kernel1x1_core<float, float, float, bool=0, bool=0, bool=0, bool=0, bool=0, bool=1, bool=1>(float* const *, float const * const *, float const * const *, float*, float const *, float const *, int, int, int, int, int, int, __int64, __int64, __int64, float const *, float const *, float, float, int, int) [4475]
您观察到的不一致似乎是由库中一个内核更改为另一个内核引起的,这可能是由某些批处理大小标准造成的。您可以看到两个内核似乎每个批次项目都使用一个块,较大尺寸的内核使用 256 个线程的 2D 块,而较小尺寸的内核使用 128 个线程的 1D 块。除此之外,性能差异取决于内部实现细节。尽管这样做可能违反了最终用户许可,但如果您想了解更多信息,您将需要反汇编内核并查看它们的工作原理。该工具包包含执行此操作所需的所有工具,但我不建议您这样做。