我正在尝试使用 Chapel 提高我的运行次矩阵乘法

Question

我正在努力提高我的矩阵乘法速度。有没有我可以做的其他实现来加快它的速度到目前为止，这是我的结果，我尝试执行 8192，但花了 2 个多小时，我的 ssh 连接超时。

这是我的实现：

use Random, Time;
var t : Timer;
t.start();

config const size = 10;
var grid : [1..size, 1..size] real;
var grid2 : [1..size, 1..size] real;
var grid3 : [1..size, 1..size] real;

fillRandom(grid);
fillRandom(grid2);

//t.start();
forall i in 1..size {
    forall j in 1..size {
        forall k in 1..size {
            grid3[i,j] += grid[i,k] * grid2[k,j];
        }
    }
}
t.stop();
writeln("Done!:");
writeln(t.elapsed(),"seconds");
writeln("Size of matrix was:", size);
t.clear();

我正在将时间与 C++ 中的 MPI 实现进行比较。我想知道是否有一种方法可以将我的矩阵分发到类似于 MPI 的两个语言环境？

Answer 1

这种 forall 循环的嵌套在我们当前的实现中没有提供最佳性能。如果您遍历定义 (i,j) 的迭代 space 的单个二维域，您的算法将执行得更快。在 k 上进行串行循环将避免更新 grid3[i,j] 时的数据竞争。例如：

....
const D2 = {1..size, 1..size};
forall (i,j) in D2 do
  for k in 1..size do
    grid3[i,j] += grid[i,k] * grid2[k,j];

要分发您的矩阵，您可以使用 Block 分发 - 请参阅我们的 online docs 中的示例。分发时，您当然需要注意语言环境之间的额外通信。

测试性能时，一定要用--fast编译。

Answer 2

_{Ex-post remark: see the benchmarked Vass' proposed performance edge on the hosted eco-system, would be great to reproduce this on multi-locale silicon so as to prove it universal for larger scales...}

性能？
基准！_{... ，没有例外，没有借口 - 惊喜并不少见 -ccflags -O3}

这就是 chapel 如此出色的原因。非常感谢 Chapel 团队在过去十年中为 HPC 开发和改进了如此出色的计算工具。

在真正的爱中-[PARALLEL] 努力，性能始终是设计实践和底层系统硬件 的结果，而不仅仅是语法-构造函数授予 "bonus".

使用 TiO.run online IDE for demo-test，单个语言环境硅的结果如下：

TiO.run platform uses   1 numLocales,
               having   2 physical CPU-cores accessible (numPU-s)
                 with   2 maxTaskPar parallelism limit

For grid{1,2,3}[ 128, 128] the tested forall sum-product took       3.124 [us] incl. fillRandom()-ops
For grid{1,2,3}[ 128, 128] the tested forall sum-product took       2.183 [us] excl. fillRandom()-ops
For grid{1,2,3}[ 128, 128] the Vass-proposed sum-product took       1.925 [us] excl. fillRandom()-ops

For grid{1,2,3}[ 256, 256] the tested forall sum-product took      28.593 [us] incl. fillRandom()-ops
For grid{1,2,3}[ 256, 256] the tested forall sum-product took      25.254 [us] excl. fillRandom()-ops
For grid{1,2,3}[ 256, 256] the Vass-proposed sum-product took      21.493 [us] excl. fillRandom()-ops

For grid{1,2,3}[1024,1024] the tested forall sum-product took   2.658.560 [us] incl. fillRandom()-ops
For grid{1,2,3}[1024,1024] the tested forall sum-product took   2.604.783 [us] excl. fillRandom()-ops
For grid{1,2,3}[1024,1024] the Vass-proposed sum-product took   2.103.592 [us] excl. fillRandom()-ops

For grid{1,2,3}[2048,2048] the tested forall sum-product took  27.137.060 [us] incl. fillRandom()-ops
For grid{1,2,3}[2048,2048] the tested forall sum-product took  26.945.871 [us] excl. fillRandom()-ops
For grid{1,2,3}[2048,2048] the Vass-proposed sum-product took  25.351.754 [us] excl. fillRandom()-ops

For grid{1,2,3}[2176,2176] the tested forall sum-product took  45.561.399 [us] incl. fillRandom()-ops
For grid{1,2,3}[2176,2176] the tested forall sum-product took  45.375.282 [us] excl. fillRandom()-ops
For grid{1,2,3}[2176,2176] the Vass-proposed sum-product took  41.304.391 [us] excl. fillRandom()-ops

--fast --ccflags -O3

For grid{1,2,3}[2176,2176] the tested forall sum-product took  39.680.133 [us] incl. fillRandom()-ops
For grid{1,2,3}[2176,2176] the tested forall sum-product took  39.494.035 [us] excl. fillRandom()-ops
For grid{1,2,3}[2176,2176] the Vass-proposed sum-product took  44.611.009 [us] excl. fillRandom()-ops

结果：

在单一区域设置（单个虚拟化处理器，2 个内核，运行所有 public 时间受限（< 60 [s]）分时 public 工作负载 online IDE ) 设备仍然是次优的 ( Vass 的方法进一步关于 20% 更快 ) 网格尺度的产量 2048x2048 关于 ~ 4.17 x 比 Nvidia Jetson 托管计算更快 结果。随着 memory-I/O 范围的增长和 CPU-L1/L2/L3-cache 预取的范围（应该遵循 ~ O(n^3) 缩放）进一步提高基于 CPU 的 NUMA 平台的性能优势。

很高兴看到在多区域设置设备和本机 Cray NUMA 集群平台上执行可实现的性能和 ~ O(n^3) 缩放服从。

use Random, Time, IO.FormattedIO;
var t1 : Timer;
var t2 : Timer;

t1.start(); //----------------------------------------------------------------[1]

config const size = 2048;

var grid1 : [1..size, 1..size] real;
var grid2 : [1..size, 1..size] real;
var grid3 : [1..size, 1..size] real;

fillRandom(grid1);
fillRandom(grid2);

t2.start(); //====================================[2]

forall i in 1..size {
    forall j in 1..size {
        forall k in 1..size {
            grid3[i,j] += grid1[i,k] * grid2[k,j];
        }
    }
}

t2.stop(); //=====================================[2]
t1.stop(); //-----------------------------------------------------------------[1]
             writef( "For grid{1,2,3}[%4i,%4i] the tested forall sum-product took %12i [us] incl. fillRandom()-ops\n",
                      size,
                      size,
                      t1.elapsed( TimeUnits.microseconds )
                      );
             writef( "For grid{1,2,3}[%4i,%4i] the tested forall sum-product took %12i [us] excl. fillRandom()-ops\n",
                      size,
                      size,
                      t2.elapsed( TimeUnits.microseconds )
                      );
///////////////////////////////////////////////////////////////////////////////////
t1.clear();
t2.clear();

const D3 = {1..size, 1..size, 1..size};

t2.start(); //====================================[3]

forall (i,j,k) in D3 do
  grid3[i,j] += grid1[i,k] * grid2[k,j];

t2.stop(); //=====================================[3]
             writef( "For grid{1,2,3}[%4i,%4i] the Vass-proposed sum-product took %12i [us] excl. fillRandom()-ops\n",
                      size,
                      size,
                      t2.elapsed( TimeUnits.microseconds )
                      );
//\//\//\//\//\//\//\//\//\//\//\//\//\//\//\//\//\//\//\//\//\//\//\//\//\//\//\//\//\//\//\//\//\//\//\//\//\//\//\//\
//       TiO.run platform uses   1 numLocales, having   2 physical CPU-cores accessible (numPU-s) with   2 maxTaskPar parallelism limit
writef( "TiO.run platform uses %3i numLocales, having %3i physical CPU-cores accessible (numPU-s) with %3i maxTaskPar parallelism limit\n",
                      numLocales,
                      here.numPUs( false, true ),
                      here.maxTaskPar
                      );

Answer 3

在 Chapel 中，forall 循环不会自动跨不同的区域设置工作或数据（想想：计算节点或内存）。相反，forall 循环调用与您正在迭代的对象相关联的并行迭代器。

因此，如果您要遍历单个区域设置的局部内容，例如范围（如代码对 1..size 的使用）或非分布式域或数组（如 grid 在你的代码中），所有用于实现并行循环的任务都将在原始语言环境中本地执行。相反，如果您在分布式域或数组上进行迭代（例如 Block-distributed), or invoking a distributed iterator (e.g., one from the DistributedIters 模块），任务将分布在 iterand 目标的所有区域设置中。

因此，任何不引用其他语言环境的 Chapel 程序——无论是显式地通过 on-clauses 还是隐式地通过包装 on-clauses 的抽象，如上面提到的分布式数组和迭代器——将永远不会使用其他资源比初始语言环境的。

我还想提供有关分布式算法的旁注：即使您要更新上面的程序以跨多个区域设置分布网格数组和 forall 循环，三重嵌套循环方法也很少是最佳矩阵乘法分布式内存系统上的算法，因为它不能很好地针对局部性进行优化。更好的方法可能是研究为分布式内存设计的矩阵乘法算法（例如，SUMMA）。

我正在尝试使用 Chapel 提高我的 运行 次矩阵乘法

I am trying to improve my run times with Chapel for matrix multiplication

chapel

性能？ 基准！ ... ，没有例外，没有借口 - 惊喜并不少见 -ccflags -O3

结果：

我正在尝试使用 Chapel 提高我的运行次矩阵乘法

性能？
基准！_{... ，没有例外，没有借口 - 惊喜并不少见 -ccflags -O3}