使用 axpy 的 CuBlas 矩阵加法

Question

我正在尝试使用 Alea CuBlas axpy 进行矩阵加法，但它似乎只添加了顶行

let matrixAddition (a:float[,]) (b: float[,]) =
     use mA = gpu.AllocateDevice(a)
     use mB = gpu.AllocateDevice(b)
     blas.Axpy(a.Length,1.,mA.Ptr,1,mB.Ptr,1)
     Gpu.Copy2DToHost(mB)

Answer 1

我以你的例子为例，它运行良好。

代码：

        var gpu = Gpu.Default;
        var blas = Blas.Get(Gpu.Default);

        var hostA = new float[,]
        {
            {1, 2, 3},
            {4, 5, 6},
            {7, 8, 9},
        };

        var hostB = new float[,]
        {
            {10, 20, 30},
            {40, 50, 60},
            {70, 80, 90},
        };

        PrintArray(hostA);
        PrintArray(hostB);

        var deviceA = gpu.AllocateDevice(hostA);
        var deviceB = gpu.AllocateDevice(hostB);

        blas.Axpy(deviceA.Length, 1f, deviceA.Ptr, 1, deviceB.Ptr, 1);

        var hostC = Gpu.Copy2DToHost(deviceB);

        PrintArray(hostC);

打印助手：

    private static void PrintArray(float[,] array)
    {
        for (var i = 0; i < array.GetLength(0); i++)
        {
            for (var k = 0; k < array.GetLength(1); k++)
            {
                Console.Write("{0} ", array[i, k]);
            }

            Console.WriteLine();
        }

        Console.WriteLine(new string('-', 10));
    }

这是我得到的：

两个问题： - 您使用的是什么版本的 AleaGpu？ - 您使用的是哪个版本的 CUDA 工具包？

我根据以下代码编写示例：Alea 3.0.4-beta2 并且我有 CudaToolkit 8.0.

只是为了确保我尝试用 F# 编写您的示例代码。（我 F# 不流利）

代码：

let gpu = Gpu.Default;
let blas = Blas.Get(Gpu.Default);

let hostA: float[,] = array2D [[  1.0;  2.0;  3.0 ]; [  4.0;  5.0;  6.0 ]; [  7.0;  8.0;  9.0 ]]
let hostB: float[,] = array2D [[ 10.0; 20.0; 30.0 ]; [ 40.0; 50.0; 60.0 ]; [ 70.0; 80.0; 90.0 ]]

PrintArray(hostA)
PrintArray(hostB)

use deviceA = gpu.AllocateDevice(hostA);
use deviceB = gpu.AllocateDevice(hostB);

blas.Axpy(deviceA.Length, 1.0, deviceA.Ptr, 1, deviceB.Ptr, 1);

let hostC = Gpu.Copy2DToHost(deviceB);

PrintArray(hostC)

打印助手：

let PrintArray(array: float[,]): unit =
    for i in 0 .. array.GetLength(0) - 1 do
        for k in 0 .. array.GetLength(1) - 1 do
            Console.Write("{0} ", array.[i, k]);
        Console.WriteLine();

    Console.WriteLine(new string('-', 10));

Answer 2

JokingBear 的代码和 redb 的代码之间有一个重要的区别。

在这行有问题的代码

blas.Axpy(a.Length,1.,mA.Ptr,1,mB.Ptr,1)

a 具有类型 float[] 并且长度将是该矩阵中元素的数量 a .

但是，更正后的代码使用了这个

blas.Axpy(deviceA.Length, 1f, deviceA.Ptr, 1, deviceB.Ptr, 1);

deviceA 不再是 float[] 而是 DeviceMemory2D object.

DeviceMemory2D.Length 比 (float[]).Length[=35 出奇地大（在我的硬件上 3x3 矩阵为 384） =] 由于某些未知原因，GPU 上的分配似乎占用了更多 space。

JokingBear 的代码只对第一行求和的关键原因是 (float[]).Length 对于 GPU 内存上的数据结构来说太短了更长。跟alea的版本没有关系。

使用 axpy 的 CuBlas 矩阵加法

CuBlas Matrix Addition using axpy

aleagpu