我正在尝试使用 Chapel 提高我的 运行 次矩阵乘法
I am trying to improve my run times with Chapel for matrix multiplication
我正在努力提高我的矩阵乘法速度。
有没有我可以做的其他实现来加快它的速度
到目前为止,这是我的结果,我尝试执行 8192,但花了 2 个多小时,我的 ssh 连接超时。
这是我的实现:
use Random, Time;
var t : Timer;
t.start();
config const size = 10;
var grid : [1..size, 1..size] real;
var grid2 : [1..size, 1..size] real;
var grid3 : [1..size, 1..size] real;
fillRandom(grid);
fillRandom(grid2);
//t.start();
forall i in 1..size {
forall j in 1..size {
forall k in 1..size {
grid3[i,j] += grid[i,k] * grid2[k,j];
}
}
}
t.stop();
writeln("Done!:");
writeln(t.elapsed(),"seconds");
writeln("Size of matrix was:", size);
t.clear();
我正在将时间与 C++ 中的 MPI 实现进行比较。我想知道是否有一种方法可以将我的矩阵分发到类似于 MPI 的两个语言环境?
这种 forall 循环的嵌套在我们当前的实现中没有提供最佳性能。如果您遍历定义 (i,j) 的迭代 space 的单个二维域,您的算法将执行得更快。在 k 上进行串行循环将避免更新 grid3[i,j] 时的数据竞争。例如:
....
const D2 = {1..size, 1..size};
forall (i,j) in D2 do
for k in 1..size do
grid3[i,j] += grid[i,k] * grid2[k,j];
要分发您的矩阵,您可以使用 Block 分发 - 请参阅我们的 online docs 中的示例。分发时,您当然需要注意语言环境之间的额外通信。
测试性能时,一定要用--fast
编译。
Ex-post remark: see the benchmarked Vass' proposed performance edge on the hosted eco-system, would be great to reproduce this on multi-locale silicon so as to prove it universal for larger scales...
性能?
基准! ... ,没有例外,没有借口 - 惊喜并不少见 -ccflags -O3
这就是 chapel 如此出色的原因。非常感谢 Chapel 团队在过去十年中为 HPC 开发和改进了如此出色的计算工具。
在真正的爱中-[PARALLEL]
努力,性能始终是设计实践和底层系统硬件 的结果,而不仅仅是语法-构造函数授予 "bonus".
使用 TiO.run online IDE for demo-test,单个语言环境硅的结果如下:
TiO.run platform uses 1 numLocales,
having 2 physical CPU-cores accessible (numPU-s)
with 2 maxTaskPar parallelism limit
For grid{1,2,3}[ 128, 128] the tested forall sum-product took 3.124 [us] incl. fillRandom()-ops
For grid{1,2,3}[ 128, 128] the tested forall sum-product took 2.183 [us] excl. fillRandom()-ops
For grid{1,2,3}[ 128, 128] the Vass-proposed sum-product took 1.925 [us] excl. fillRandom()-ops
For grid{1,2,3}[ 256, 256] the tested forall sum-product took 28.593 [us] incl. fillRandom()-ops
For grid{1,2,3}[ 256, 256] the tested forall sum-product took 25.254 [us] excl. fillRandom()-ops
For grid{1,2,3}[ 256, 256] the Vass-proposed sum-product took 21.493 [us] excl. fillRandom()-ops
For grid{1,2,3}[1024,1024] the tested forall sum-product took 2.658.560 [us] incl. fillRandom()-ops
For grid{1,2,3}[1024,1024] the tested forall sum-product took 2.604.783 [us] excl. fillRandom()-ops
For grid{1,2,3}[1024,1024] the Vass-proposed sum-product took 2.103.592 [us] excl. fillRandom()-ops
For grid{1,2,3}[2048,2048] the tested forall sum-product took 27.137.060 [us] incl. fillRandom()-ops
For grid{1,2,3}[2048,2048] the tested forall sum-product took 26.945.871 [us] excl. fillRandom()-ops
For grid{1,2,3}[2048,2048] the Vass-proposed sum-product took 25.351.754 [us] excl. fillRandom()-ops
For grid{1,2,3}[2176,2176] the tested forall sum-product took 45.561.399 [us] incl. fillRandom()-ops
For grid{1,2,3}[2176,2176] the tested forall sum-product took 45.375.282 [us] excl. fillRandom()-ops
For grid{1,2,3}[2176,2176] the Vass-proposed sum-product took 41.304.391 [us] excl. fillRandom()-ops
--fast --ccflags -O3
For grid{1,2,3}[2176,2176] the tested forall sum-product took 39.680.133 [us] incl. fillRandom()-ops
For grid{1,2,3}[2176,2176] the tested forall sum-product took 39.494.035 [us] excl. fillRandom()-ops
For grid{1,2,3}[2176,2176] the Vass-proposed sum-product took 44.611.009 [us] excl. fillRandom()-ops
结果:
在单一区域设置(单个虚拟化处理器,2 个内核,运行 所有 public 时间受限(< 60 [s])分时 public 工作负载 online IDE ) 设备仍然是次优的 ( Vass 的方法进一步关于 20%
更快 ) 网格尺度的产量 2048x2048
关于 ~ 4.17 x
比 Nvidia Jetson 托管计算更快 结果。随着 memory-I/O 范围的增长和 CPU-L1/L2/L3-cache 预取的范围(应该遵循 ~ O(n^3)
缩放)进一步提高基于 CPU 的 NUMA 平台的性能优势。
很高兴看到在多区域设置设备和本机 Cray NUMA 集群平台上执行可实现的性能和 ~ O(n^3)
缩放服从。
use Random, Time, IO.FormattedIO;
var t1 : Timer;
var t2 : Timer;
t1.start(); //----------------------------------------------------------------[1]
config const size = 2048;
var grid1 : [1..size, 1..size] real;
var grid2 : [1..size, 1..size] real;
var grid3 : [1..size, 1..size] real;
fillRandom(grid1);
fillRandom(grid2);
t2.start(); //====================================[2]
forall i in 1..size {
forall j in 1..size {
forall k in 1..size {
grid3[i,j] += grid1[i,k] * grid2[k,j];
}
}
}
t2.stop(); //=====================================[2]
t1.stop(); //-----------------------------------------------------------------[1]
writef( "For grid{1,2,3}[%4i,%4i] the tested forall sum-product took %12i [us] incl. fillRandom()-ops\n",
size,
size,
t1.elapsed( TimeUnits.microseconds )
);
writef( "For grid{1,2,3}[%4i,%4i] the tested forall sum-product took %12i [us] excl. fillRandom()-ops\n",
size,
size,
t2.elapsed( TimeUnits.microseconds )
);
///////////////////////////////////////////////////////////////////////////////////
t1.clear();
t2.clear();
const D3 = {1..size, 1..size, 1..size};
t2.start(); //====================================[3]
forall (i,j,k) in D3 do
grid3[i,j] += grid1[i,k] * grid2[k,j];
t2.stop(); //=====================================[3]
writef( "For grid{1,2,3}[%4i,%4i] the Vass-proposed sum-product took %12i [us] excl. fillRandom()-ops\n",
size,
size,
t2.elapsed( TimeUnits.microseconds )
);
//\//\//\//\//\//\//\//\//\//\//\//\//\//\//\//\//\//\//\//\//\//\//\//\//\//\//\//\//\//\//\//\//\//\//\//\//\//\//\//\
// TiO.run platform uses 1 numLocales, having 2 physical CPU-cores accessible (numPU-s) with 2 maxTaskPar parallelism limit
writef( "TiO.run platform uses %3i numLocales, having %3i physical CPU-cores accessible (numPU-s) with %3i maxTaskPar parallelism limit\n",
numLocales,
here.numPUs( false, true ),
here.maxTaskPar
);
在 Chapel 中,forall 循环不会自动跨不同的区域设置工作或数据(想想:计算节点或内存)。相反,forall 循环调用与您正在迭代的对象相关联的并行迭代器。
因此,如果您要遍历单个区域设置的局部内容,例如范围(如代码对 1..size
的使用)或非分布式域或数组(如 grid
在你的代码中),所有用于实现并行循环的任务都将在原始语言环境中本地执行。相反,如果您在分布式域或数组上进行迭代(例如 Block-distributed), or invoking a distributed iterator (e.g., one from the DistributedIters 模块),任务将分布在 iterand 目标的所有区域设置中。
因此,任何不引用其他语言环境的 Chapel 程序——无论是显式地通过 on-clauses 还是隐式地通过包装 on-clauses 的抽象,如上面提到的分布式数组和迭代器——将永远不会使用其他资源比初始语言环境的。
我还想提供有关分布式算法的旁注:即使您要更新上面的程序以跨多个区域设置分布网格数组和 forall 循环,三重嵌套循环方法也很少是最佳矩阵乘法分布式内存系统上的算法,因为它不能很好地针对局部性进行优化。更好的方法可能是研究为分布式内存设计的矩阵乘法算法(例如,SUMMA)。
我正在努力提高我的矩阵乘法速度。
有没有我可以做的其他实现来加快它的速度
到目前为止,这是我的结果,我尝试执行 8192,但花了 2 个多小时,我的 ssh 连接超时。
这是我的实现:
use Random, Time;
var t : Timer;
t.start();
config const size = 10;
var grid : [1..size, 1..size] real;
var grid2 : [1..size, 1..size] real;
var grid3 : [1..size, 1..size] real;
fillRandom(grid);
fillRandom(grid2);
//t.start();
forall i in 1..size {
forall j in 1..size {
forall k in 1..size {
grid3[i,j] += grid[i,k] * grid2[k,j];
}
}
}
t.stop();
writeln("Done!:");
writeln(t.elapsed(),"seconds");
writeln("Size of matrix was:", size);
t.clear();
我正在将时间与 C++ 中的 MPI 实现进行比较。我想知道是否有一种方法可以将我的矩阵分发到类似于 MPI 的两个语言环境?
这种 forall 循环的嵌套在我们当前的实现中没有提供最佳性能。如果您遍历定义 (i,j) 的迭代 space 的单个二维域,您的算法将执行得更快。在 k 上进行串行循环将避免更新 grid3[i,j] 时的数据竞争。例如:
....
const D2 = {1..size, 1..size};
forall (i,j) in D2 do
for k in 1..size do
grid3[i,j] += grid[i,k] * grid2[k,j];
要分发您的矩阵,您可以使用 Block 分发 - 请参阅我们的 online docs 中的示例。分发时,您当然需要注意语言环境之间的额外通信。
测试性能时,一定要用--fast
编译。
Ex-post remark: see the benchmarked Vass' proposed performance edge on the hosted eco-system, would be great to reproduce this on multi-locale silicon so as to prove it universal for larger scales...
性能?
基准! ... ,没有例外,没有借口 - 惊喜并不少见 -ccflags -O3
-ccflags -O3
这就是 chapel 如此出色的原因。非常感谢 Chapel 团队在过去十年中为 HPC 开发和改进了如此出色的计算工具。
在真正的爱中-[PARALLEL]
努力,性能始终是设计实践和底层系统硬件 的结果,而不仅仅是语法-构造函数授予 "bonus".
使用 TiO.run online IDE for demo-test,单个语言环境硅的结果如下:
TiO.run platform uses 1 numLocales,
having 2 physical CPU-cores accessible (numPU-s)
with 2 maxTaskPar parallelism limit
For grid{1,2,3}[ 128, 128] the tested forall sum-product took 3.124 [us] incl. fillRandom()-ops
For grid{1,2,3}[ 128, 128] the tested forall sum-product took 2.183 [us] excl. fillRandom()-ops
For grid{1,2,3}[ 128, 128] the Vass-proposed sum-product took 1.925 [us] excl. fillRandom()-ops
For grid{1,2,3}[ 256, 256] the tested forall sum-product took 28.593 [us] incl. fillRandom()-ops
For grid{1,2,3}[ 256, 256] the tested forall sum-product took 25.254 [us] excl. fillRandom()-ops
For grid{1,2,3}[ 256, 256] the Vass-proposed sum-product took 21.493 [us] excl. fillRandom()-ops
For grid{1,2,3}[1024,1024] the tested forall sum-product took 2.658.560 [us] incl. fillRandom()-ops
For grid{1,2,3}[1024,1024] the tested forall sum-product took 2.604.783 [us] excl. fillRandom()-ops
For grid{1,2,3}[1024,1024] the Vass-proposed sum-product took 2.103.592 [us] excl. fillRandom()-ops
For grid{1,2,3}[2048,2048] the tested forall sum-product took 27.137.060 [us] incl. fillRandom()-ops
For grid{1,2,3}[2048,2048] the tested forall sum-product took 26.945.871 [us] excl. fillRandom()-ops
For grid{1,2,3}[2048,2048] the Vass-proposed sum-product took 25.351.754 [us] excl. fillRandom()-ops
For grid{1,2,3}[2176,2176] the tested forall sum-product took 45.561.399 [us] incl. fillRandom()-ops
For grid{1,2,3}[2176,2176] the tested forall sum-product took 45.375.282 [us] excl. fillRandom()-ops
For grid{1,2,3}[2176,2176] the Vass-proposed sum-product took 41.304.391 [us] excl. fillRandom()-ops
--fast --ccflags -O3
For grid{1,2,3}[2176,2176] the tested forall sum-product took 39.680.133 [us] incl. fillRandom()-ops
For grid{1,2,3}[2176,2176] the tested forall sum-product took 39.494.035 [us] excl. fillRandom()-ops
For grid{1,2,3}[2176,2176] the Vass-proposed sum-product took 44.611.009 [us] excl. fillRandom()-ops
结果:
在单一区域设置(单个虚拟化处理器,2 个内核,运行 所有 public 时间受限(< 60 [s])分时 public 工作负载 online IDE ) 设备仍然是次优的 ( Vass 的方法进一步关于 20%
更快 ) 网格尺度的产量 2048x2048
关于 ~ 4.17 x
比 Nvidia Jetson 托管计算更快 结果。随着 memory-I/O 范围的增长和 CPU-L1/L2/L3-cache 预取的范围(应该遵循 ~ O(n^3)
缩放)进一步提高基于 CPU 的 NUMA 平台的性能优势。
很高兴看到在多区域设置设备和本机 Cray NUMA 集群平台上执行可实现的性能和 ~ O(n^3)
缩放服从。
use Random, Time, IO.FormattedIO;
var t1 : Timer;
var t2 : Timer;
t1.start(); //----------------------------------------------------------------[1]
config const size = 2048;
var grid1 : [1..size, 1..size] real;
var grid2 : [1..size, 1..size] real;
var grid3 : [1..size, 1..size] real;
fillRandom(grid1);
fillRandom(grid2);
t2.start(); //====================================[2]
forall i in 1..size {
forall j in 1..size {
forall k in 1..size {
grid3[i,j] += grid1[i,k] * grid2[k,j];
}
}
}
t2.stop(); //=====================================[2]
t1.stop(); //-----------------------------------------------------------------[1]
writef( "For grid{1,2,3}[%4i,%4i] the tested forall sum-product took %12i [us] incl. fillRandom()-ops\n",
size,
size,
t1.elapsed( TimeUnits.microseconds )
);
writef( "For grid{1,2,3}[%4i,%4i] the tested forall sum-product took %12i [us] excl. fillRandom()-ops\n",
size,
size,
t2.elapsed( TimeUnits.microseconds )
);
///////////////////////////////////////////////////////////////////////////////////
t1.clear();
t2.clear();
const D3 = {1..size, 1..size, 1..size};
t2.start(); //====================================[3]
forall (i,j,k) in D3 do
grid3[i,j] += grid1[i,k] * grid2[k,j];
t2.stop(); //=====================================[3]
writef( "For grid{1,2,3}[%4i,%4i] the Vass-proposed sum-product took %12i [us] excl. fillRandom()-ops\n",
size,
size,
t2.elapsed( TimeUnits.microseconds )
);
//\//\//\//\//\//\//\//\//\//\//\//\//\//\//\//\//\//\//\//\//\//\//\//\//\//\//\//\//\//\//\//\//\//\//\//\//\//\//\//\
// TiO.run platform uses 1 numLocales, having 2 physical CPU-cores accessible (numPU-s) with 2 maxTaskPar parallelism limit
writef( "TiO.run platform uses %3i numLocales, having %3i physical CPU-cores accessible (numPU-s) with %3i maxTaskPar parallelism limit\n",
numLocales,
here.numPUs( false, true ),
here.maxTaskPar
);
在 Chapel 中,forall 循环不会自动跨不同的区域设置工作或数据(想想:计算节点或内存)。相反,forall 循环调用与您正在迭代的对象相关联的并行迭代器。
因此,如果您要遍历单个区域设置的局部内容,例如范围(如代码对 1..size
的使用)或非分布式域或数组(如 grid
在你的代码中),所有用于实现并行循环的任务都将在原始语言环境中本地执行。相反,如果您在分布式域或数组上进行迭代(例如 Block-distributed), or invoking a distributed iterator (e.g., one from the DistributedIters 模块),任务将分布在 iterand 目标的所有区域设置中。
因此,任何不引用其他语言环境的 Chapel 程序——无论是显式地通过 on-clauses 还是隐式地通过包装 on-clauses 的抽象,如上面提到的分布式数组和迭代器——将永远不会使用其他资源比初始语言环境的。
我还想提供有关分布式算法的旁注:即使您要更新上面的程序以跨多个区域设置分布网格数组和 forall 循环,三重嵌套循环方法也很少是最佳矩阵乘法分布式内存系统上的算法,因为它不能很好地针对局部性进行优化。更好的方法可能是研究为分布式内存设计的矩阵乘法算法(例如,SUMMA)。