相同的代码，C#、AleaGPU 和设备内存的不同行为

Question

我正在使用 AleaGPU 库执行矩阵乘法和类似操作，但我似乎无法理解为什么我的代码没有按预期工作。

By "not working as expecting" I mean that the resulting matrix has the first row (or the first few rows) with the right values, and the rest of the rows are all filled with 0s, with the same code I've used in the other code samples below.

函数 #1（不起作用）：这个函数由于某种原因不起作用，并且具有上述行为。听起来我混淆了一个索引，但我没有看到下面三个示例的代码有任何区别，而且我没有收到任何类型的错误（AleaGPU 通常在尝试访问无效数组时抛出异常位置）。

public static double[,] Multiply([NotNull] this double[,] m1, [NotNull] double[,] m2)
{
    // Checks
    if (m1.GetLength(1) != m2.GetLength(0)) throw new ArgumentOutOfRangeException("Invalid matrices sizes");

    // Initialize the parameters and the result matrix
    int h = m1.GetLength(0);
    int w = m2.GetLength(1);
    int l = m1.GetLength(1);

    // Execute the multiplication in parallel
    using (DeviceMemory2D<double> m1_device = Gpu.Default.AllocateDevice(m1))
    using (DeviceMemory2D<double> m2_device = Gpu.Default.AllocateDevice(m2))
    using (DeviceMemory2D<double> mresult_device = Gpu.Default.AllocateDevice<double>(h, w))
    {
        // Pointers setup
        deviceptr<double>
            pm1 = m1_device.Ptr,
            pm2 = m2_device.Ptr,
            pmresult = mresult_device.Ptr;

        // Local wrapper function
        void Kernel(int ki)
        {
            // Calculate the current indexes
            int
                i = ki / w,
                j = ki % w;

            // Perform the multiplication
            double sum = 0;
            int im1 = i * l;
            for (int k = 0; k < l; k++)
            {
                // m1[i, k] * m2[k, j]
                sum += pm1[im1 + k] * pm2[k * w + j];
            }
            pmresult[i * w + j] = sum; // result[i, j]
        }

        // Get the pointers and iterate fo each row
        Gpu.Default.For(0, h * w, Kernel);

        // Return the result
        return Gpu.Copy2DToHost(mresult_device);
    }
}

我花了几个小时查看这段代码，试图检查每一行，但我真的看不出它有什么问题。

这个很好用，但我看不出与第一个有什么区别

public static double[,] MultiplyGpuManaged([NotNull] this double[,] m1, [NotNull] double[,] m2)
{
    // Checks
    if (m1.GetLength(1) != m2.GetLength(0)) throw new ArgumentOutOfRangeException("Invalid matrices sizes");

    // Initialize the parameters and the result matrix
    int h = m1.GetLength(0);
    int w = m2.GetLength(1);
    int l = m1.GetLength(1);
    double[,]
        m1_gpu = Gpu.Default.Allocate(m1),
        m2_gpu = Gpu.Default.Allocate(m2),
        mresult_gpu = Gpu.Default.Allocate<double>(h, w);

    // Execute the multiplication in parallel
    Gpu.Default.For(0, h * w, index =>
    {
        // Calculate the current indexes
        int
            i = index / w,
            j = index % w;

        // Perform the multiplication
        double sum = 0;
        for (int k = 0; k < l; k++)
        {
            sum += m1_gpu[i, k] * m2_gpu[k, j];
        }
        mresult_gpu[i, j] = sum;
    });

    // Free memory and copy the result back
    Gpu.Free(m1_gpu);
    Gpu.Free(m2_gpu);
    double[,] result = Gpu.CopyToHost(mresult_gpu);
    Gpu.Free(mresult_gpu);
    return result;
}

这个也很好用，我做了这个额外的测试来检查我是否弄乱了第一个函数中的索引（显然它们很好）

public static double[,] MultiplyOnCPU([NotNull] this double[,] m1, [NotNull] double[,] m2)
{
    // Checks
    if (m1.GetLength(1) != m2.GetLength(0)) throw new ArgumentOutOfRangeException("Invalid matrices sizes");

    // Initialize the parameters and the result matrix
    int h = m1.GetLength(0);
    int w = m2.GetLength(1);
    int l = m1.GetLength(1);
    double[,] result = new double[h, w];
    Parallel.For(0, h * w, index =>
    {
        unsafe
        {
            fixed (double* presult = result, pm1 = m1, pm2 = m2)
            {
                // Calculate the current indexes
                int
                    i = index / w,
                    j = index % w;

                // Perform the multiplication
                double sum = 0;
                int im1 = i * l;
                for (int k = 0; k < l; k++)
                {
                    sum += pm1[im1 + k] * pm2[k * w + j];
                }
                presult[i * w + j] = sum;
            }
        }
    });
    return result;
}

我真的不明白我在第一种方法中遗漏了什么，也不明白为什么它不起作用。

提前感谢您的帮助！

Answer 1

原来问题是由 gpu 用于分配二维数组的方法引起的 - 它没有像标准 .NET 数组那样使用单个连续内存块，而是在每一行的末尾添加一些填充以提高性能原因。

寻址 2D gpu 阵列的正确方法是使用间距，它表示每一行的有效宽度（列 + 填充）。

这是一个工作代码示例，它仅填充 2D gpu 数组并将其复制回主机：

const int size = 10;
double[,] matrix_gpu;
using (DeviceMemory2D<double> m_gpu = Gpu.Default.AllocateDevice<double>(size, size))
{
    deviceptr<double> ptr = m_gpu.Ptr;
    int pitch = m_gpu.PitchInElements.ToInt32();
    Gpu.Default.For(0, size, i =>
    {
        for (int j = 0; j < size; j++)
        {
            ptr[i * pitch + j] = i * size + j;
        }
    });
    matrix_gpu = Gpu.Copy2DToHost(m_gpu);
}

相同的代码，C#、AleaGPU 和设备内存的不同行为

Same code, different behavior with C#, AleaGPU and device memory

.net

c#

wpf

visual-studio

aleagpu