CUBLAS 转置矩阵乘法问题
CUBLAS Transpose matrix multiplication problem
我正在尝试在 CUBLAS 中乘以 C = At * B
。问题是,使用我拥有的代码(我从 this 中获取)有一些矩阵维度似乎可以正常工作 int rows_a = 1, cols_a = 200, rows_b = 1, cols_b = 200
。相反,有些维度的值不正确 int rows_a = 200, cols_a = 5, rows_b = 200, cols_b = 5;
。
在我的代码中,我设置了两个矩阵,然后使用 CUBLAS 函数 cublasSgemm 进行乘法运算,之后,我使用一些 CPU 函数进行相同的矩阵乘法运算,以检查它是否正常。
int main(int argc, char *argv[])
{
cublasCreate(&handle);
int rows_a = 200, cols_a = 5, rows_b = 200, cols_b = 5;
float al = 1.0f;
float bet = 0.0f;
float *a = (float *)malloc(rows_a * cols_a * sizeof(float));
float *b = (float *)malloc(rows_b * cols_b * sizeof(float));
float *c = (float *)malloc(cols_a * cols_b * sizeof(float)); // CUBLAS result
float *cpu= (float *)malloc(cols_a * cols_b * sizeof(float)); // CPU result
for (int i = 0; i < rows_a * cols_a; i++)
{
a[i] = i;
}
for (int i = 0; i < rows_b * cols_b; i++)
{
b[i] = i*4;
}
float *dev_a, *dev_b, *dev_c;
cudaMalloc((void **)&dev_a, rows_a * cols_a * sizeof(float));
cudaMalloc((void **)&dev_b, rows_b * cols_b * sizeof(float));
cudaMalloc((void **)&dev_c, cols_a * cols_b * sizeof(float));
cudaMemcpy(dev_a, a, rows_a * cols_a * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(dev_b, b, rows_b * cols_b * sizeof(float), cudaMemcpyHostToDevice);
cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_T, cols_b, cols_a, rows_b, &al, dev_b, cols_b, dev_a, cols_a, &bet, dev_c, cols_a);
cudaMemcpy(c, dev_c, cols_a * cols_b * sizeof(float), cudaMemcpyDeviceToHost);
printMatriz(c, cols_a, cols_b);
//CPU
for (int i = 0; i < cols_a; i++)
{
for (int j = 0; j < cols_b; j++)
{
float v = 0;
for (int k = 0; k < rows_a; k++)
{
v += a[(cols_a * k) + i] * b[(cols_b * k) + j];
}
cpu[(i * cols_b) + j] = v;
}
}
printMatriz(cpu, cols_a, cols_b);
}
错误的输出:
(cublas)
264670000.000000 265068000.000000 265466000.000000 265864000.000000 266262000.000000
265068000.000000 265466800.000000 265865600.000000 266264400.000000 266663200.000000
...
(cpu)
264669856.000000 265068016.000000 265466144.000000 265864000.000000 266261856.000000
265068016.000000 265466656.000000 265865584.000000 266264544.000000 266663184.000000
...
我希望这两个结果必须相同,显然我的实现不正确。有人可以帮我吗?谢谢!
我认为您只是遇到了浮点精度问题,这些值彼此之间只有几位之差。例如在 "hex notation":
265068000 is 0x1.f993bcp+27
265068016 is 0x1.f993bep+27
请注意,只有最后一位数字改变了 3 (0xf993bc - 0xf993be
),考虑到它在 200 次舍入后很接近,这非常好。
请注意,32 位 float
通常适合 7 位小数的精度,而 64 位 double
适合大约 15 位小数的精度。
我正在尝试在 CUBLAS 中乘以 C = At * B
。问题是,使用我拥有的代码(我从 this 中获取)有一些矩阵维度似乎可以正常工作 int rows_a = 1, cols_a = 200, rows_b = 1, cols_b = 200
。相反,有些维度的值不正确 int rows_a = 200, cols_a = 5, rows_b = 200, cols_b = 5;
。
在我的代码中,我设置了两个矩阵,然后使用 CUBLAS 函数 cublasSgemm 进行乘法运算,之后,我使用一些 CPU 函数进行相同的矩阵乘法运算,以检查它是否正常。
int main(int argc, char *argv[])
{
cublasCreate(&handle);
int rows_a = 200, cols_a = 5, rows_b = 200, cols_b = 5;
float al = 1.0f;
float bet = 0.0f;
float *a = (float *)malloc(rows_a * cols_a * sizeof(float));
float *b = (float *)malloc(rows_b * cols_b * sizeof(float));
float *c = (float *)malloc(cols_a * cols_b * sizeof(float)); // CUBLAS result
float *cpu= (float *)malloc(cols_a * cols_b * sizeof(float)); // CPU result
for (int i = 0; i < rows_a * cols_a; i++)
{
a[i] = i;
}
for (int i = 0; i < rows_b * cols_b; i++)
{
b[i] = i*4;
}
float *dev_a, *dev_b, *dev_c;
cudaMalloc((void **)&dev_a, rows_a * cols_a * sizeof(float));
cudaMalloc((void **)&dev_b, rows_b * cols_b * sizeof(float));
cudaMalloc((void **)&dev_c, cols_a * cols_b * sizeof(float));
cudaMemcpy(dev_a, a, rows_a * cols_a * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(dev_b, b, rows_b * cols_b * sizeof(float), cudaMemcpyHostToDevice);
cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_T, cols_b, cols_a, rows_b, &al, dev_b, cols_b, dev_a, cols_a, &bet, dev_c, cols_a);
cudaMemcpy(c, dev_c, cols_a * cols_b * sizeof(float), cudaMemcpyDeviceToHost);
printMatriz(c, cols_a, cols_b);
//CPU
for (int i = 0; i < cols_a; i++)
{
for (int j = 0; j < cols_b; j++)
{
float v = 0;
for (int k = 0; k < rows_a; k++)
{
v += a[(cols_a * k) + i] * b[(cols_b * k) + j];
}
cpu[(i * cols_b) + j] = v;
}
}
printMatriz(cpu, cols_a, cols_b);
}
错误的输出:
(cublas)
264670000.000000 265068000.000000 265466000.000000 265864000.000000 266262000.000000
265068000.000000 265466800.000000 265865600.000000 266264400.000000 266663200.000000
...
(cpu)
264669856.000000 265068016.000000 265466144.000000 265864000.000000 266261856.000000
265068016.000000 265466656.000000 265865584.000000 266264544.000000 266663184.000000
...
我希望这两个结果必须相同,显然我的实现不正确。有人可以帮我吗?谢谢!
我认为您只是遇到了浮点精度问题,这些值彼此之间只有几位之差。例如在 "hex notation":
265068000 is 0x1.f993bcp+27
265068016 is 0x1.f993bep+27
请注意,只有最后一位数字改变了 3 (0xf993bc - 0xf993be
),考虑到它在 200 次舍入后很接近,这非常好。
请注意,32 位 float
通常适合 7 位小数的精度,而 64 位 double
适合大约 15 位小数的精度。