将元组高效处理为固定大小的向量
Efficient treatment of tuples as fixed-size vectors
在 Chapel 中,同质元组可以像很小一样使用 "vectors"
(例如,a = b + c * 3.0 + 5.0;
)。
但是,由于没有为元组提供各种数学函数,我尝试用几种方法为norm()
写一个函数并比较它们的性能。我的代码是这样的:
proc norm_3tuple( x: 3*real ): real
{
return sqrt( x[1]**2 + x[2]**2 + x[3]**2 );
}
proc norm_loop( x ): real
{
var tmp = 0.0;
for i in 1 .. x.size do
tmp += x[i]**2;
return sqrt( tmp );
}
proc norm_loop_param( x ): real
{
var tmp = 0.0;
for param i in 1 .. x.size do
tmp += x[i]**2;
return sqrt( tmp );
}
proc norm_reduce( x ): real
{
var tmp = ( + reduce x**2 );
return sqrt( tmp );
}
//.........................................................
var a = ( 1.0, 2.0, 3.0 );
// consistency check
writeln( norm_3tuple( a ) );
writeln( norm_loop( a ) );
writeln( norm_loop_param( a ) );
writeln( norm_reduce( a ) );
config const nloops = 100000000; // 1E+8
var res = 0.0;
for k in 1 .. nloops
{
a[ 1 ] = (k % 5): real;
res += norm_3tuple( a );
// res += norm_loop( a );
// res += norm_loop_param( a );
// res += norm_reduce( a );
}
writeln( "result = ", res );
我用 chpl --fast test.chpl
编译了上面的代码(Chapel v1.16 on OSX10.11 with 4 cores,通过 homebrew 安装)。然后,norm_3tuple()
、norm_loop()
和 norm_loop_param()
给出了几乎相同的速度(0.45 秒),而 norm_reduce()
慢得多(大约 30 秒)。我检查了 top
命令的输出,然后 norm_reduce()
使用了所有 4 个内核,而其他功能只使用了 1 个内核。所以我的问题是...
norm_reduce()
慢是因为 reduce
并行工作
并行执行的开销很大
大于这个小元组的净计算成本?
- 鉴于我们要避免
reduce
三元组,其他三个例程运行 基本上具有相同的速度。这是否意味着显式 for 循环对于 3 元组的成本可以忽略不计(例如,通过 --fast
选项启用的循环展开)?
- 在
norm_loop_param()
中,我也尝试过对循环变量使用 param
关键字,但这几乎没有或根本没有提高性能。如果我们只对同构元组感兴趣,是否根本不需要附加 param
(为了性能)?
很抱歉一下子问了很多问题,如果能有效地处理小元组,我将不胜感激advice/suggestions。非常感谢!
Is norm_reduce()
slow because reduce
works in parallel and the overhead for parallel execution is much greater than the net computational cost for this small tuple?
我相信你是对的,这就是正在发生的事情。减少是并行执行的,Chapel 目前不会尝试做任何智能节流来压缩这种并行性,因为工作可能无法保证(如在这种情况下),所以我认为你正在承受太多的任务开销除了与其他任务协调之外几乎没有任何工作(尽管我很惊讶差异的幅度如此之大......但我也发现我对这些事情几乎没有直觉)。将来,我们希望编译器能够序列化如此小的缩减以避免这些开销。
Given that we want to avoid reduce
for 3-tuples, the other three routines run essentially with the same speed. Does this mean that explicit for
-loops have negligible cost for 3-tuples (e.g., via loop unrolling enabled by --fast
option)?
Chapel 编译器不会展开 norm_loop()
中的显式 for 循环(您可以通过检查使用 --savec
生成的代码来验证这一点标志),但可能是后端编译器。或者说,与 norm_loop_param()
的展开循环相比,for 循环确实不会花费那么多。我怀疑您需要检查生成的程序集以确定是哪种情况。但我也希望后端 C 编译器能够很好地处理我们生成的代码——例如,它很容易看出这是一个 3 次迭代循环。
In norm_loop_param()
, I have also tried using param
keyword for the loop variable, but this gave me little or no performance gain. If we are interested in homogeneous tuples only, is it not necessary to attach param
at all (for performance)?
这很难给出明确的答案,因为我认为这主要是关于后端 C 编译器有多好的问题。
Ex-post备注:其实最后还有第三次精彩的表现惊喜...
性能?
基准! ...总是,没有例外,没有借口
这就是 chapel 如此出色的原因。非常感谢 Chapel 团队在过去十年中为 HPC 开发和改进了如此出色的计算工具。
完全热爱 true-[PARALLEL]
的努力,性能始终是设计实践和底层系统硬件的结果,而不仅仅是语法-构造函数授予 "bonus".
norm_reduce()
系统处理花费几毫秒只是为了设置所有启用并发的reduce
计算设施稍后只生成和 return 单个 x**2
产品到延迟中央 +
[= 的结果队列83=]-减速器-发动机求和。单个 2 CLK CPU 微指令的开销相当多,不是吗?
出于 为什么 的原因,请 review the costs of process-scheduling details and my updated criticism of Amdahl's Law original formulation.
代码基准测试 - 实际上同时带来了两个惊喜:
+++++++++++++++++++++++++++++++++++++++++++++++ <TiO.IDE>.RUN
3.74166
[SEQ] norm_loop(): 0.0 [us] -- 3.74166
[SEQ] norm_loop_param(): 0.0 [us] -- 3.74166
[PAR]: norm_reduce(): 5677.0 [us] -- 3.74166
3.74166
[SEQ] norm_loop(): 0.0 [us] -- 3.74166
[SEQ] norm_loop_param(): 1.0 [us] -- 3.74166
[PAR]: norm_reduce(): 5818.0 [us] -- 3.74166
3.74166
[SEQ] norm_loop(): 1.0 [us] -- 3.74166
[SEQ] norm_loop_param(): 2.0 [us] -- 3.74166
[PAR]: norm_reduce(): 4886.0 [us] -- 3.74166
第一个是在最初的 post 中报道的,第二个是在礼拜堂 运行 配备 --fast
之后观察到的编译器开关:
+++++++++++++++++++++++++++++++++++++++++++++++ <TiO.IDE>.+CompilerFLAG( "--fast" ).RUN
3.74166
[SEQ] norm_loop(): 1.0 [us] -- 3.74166
[SEQ] norm_loop_param(): 2.0 [us] -- 3.74166
[PAR]: norm_reduce(): 7769.0 [us] -- 3.74166
3.74166
[SEQ] norm_loop(): 0.0 [us] -- 3.74166
[SEQ] norm_loop_param(): 0.0 [us] -- 3.74166
[PAR]: norm_reduce(): 9109.0 [us] -- 3.74166
3.74166
[SEQ] norm_loop(): 1.0 [us] -- 3.74166
[SEQ] norm_loop_param(): 1.0 [us] -- 3.74166
[PAR]: norm_reduce(): 8807.0 [us] -- 3.74166
一如既往,SuperComputing2017 HPC 在技术论文或基准测试中发布的每个方面都提升了[可再现性]。
这些结果是在 Try-it-Online 赞助的 chapel 在线平台上收集的,欢迎所有感兴趣的爱好者重新 运行 和 post 他们的本地主机/集群操作性能Chapel 代码的详细信息,以便更好地记录上述观察时间的硬件系统相关可变性(为了进一步试验准备好的 运行 时序装饰代码,可以使用此 link 来创建 TiO.IDE 的全状态快照)。
/* ---------------------------------------SETUP-SECTION-UNDER-TEST--*/ use Time;
/* ---------------------------------------SETUP-SECTION-UNDER-TEST--*/ var aStopWATCH_SEQ: Timer;
/* ---------------------------------------SETUP-SECTION-UNDER-TEST--*/ var aStopWATCH_PAR: Timer;
proc norm_3tuple( x: 3*real ): real
{
return sqrt( x[1]**2 + x[2]**2 + x[3]**2 );
}
proc norm_loop( x ): real
{
/* ---------------------------------------------SECTION-UNDER-TEST--*/ aStopWATCH_SEQ.start();
var tmp = 0.0;
for i in 1 .. x.size do
tmp += x[i]**2;
/* ---------------------------------------------SECTION-UNDER-TEST--*/ aStopWATCH_SEQ.stop(); write( "[SEQ] norm_loop(): ",
aStopWATCH_SEQ.elapsed( Time.TimeUnits.microseconds ), " [us] -- " );
return sqrt( tmp );
}
proc norm_loop_param( x ): real
{
/* ---------------------------------------------SECTION-UNDER-TEST--*/ aStopWATCH_SEQ.start();
var tmp = 0.0;
for param i in 1 .. x.size do
tmp += x[i]**2;
/* ---------------------------------------------SECTION-UNDER-TEST--*/ aStopWATCH_SEQ.stop(); write( "[SEQ] norm_loop_param(): ",
aStopWATCH_SEQ.elapsed( Time.TimeUnits.microseconds ), " [us] -- " );
return sqrt( tmp );
}
proc norm_reduce( x ): real
{
/* ---------------------------------------------SECTION-UNDER-TEST--*/ aStopWATCH_PAR.start();
var tmp = ( + reduce x**2 );
/* ---------------------------------------------SECTION-UNDER-TEST--*/ aStopWATCH_PAR.stop(); write( "[PAR]: norm_reduce(): ",
aStopWATCH_PAR.elapsed( Time.TimeUnits.microseconds ), " [us] -- " );
return sqrt( tmp );
}
//.........................................................
var a = ( 1.0, 2.0, 3.0 );
// consistency check
writeln( norm_3tuple( a ) );
writeln( norm_loop( a ) );
writeln( norm_loop_param( a ) );
writeln( norm_reduce( a ) );
Scaling:
[LOOP] norm_3tuple(): 45829.0 [us] -- result = 4.30918e+06 @ 1000000 loops.
[LOOP] norm_3tuple(): 241680 [us] -- result = 4.30918e+07 @ 10000000 loops.
[LOOP] norm_3tuple(): 2387080 [us] -- result = 4.30918e+08 @ 100000000 loops.
[LOOP] norm_loop(): 72160.0 [us] -- result = 4.30918e+06 @ 1000000 loops.
[LOOP] norm_loop(): 755959 [us] -- result = 4.30918e+07 @ 10000000 loops.
[LOOP] norm_loop(): 7783740 [us] -- result = 4.30918e+08 @ 100000000 loops.
[LOOP] norm_loop_param(): 34102.0 [us] -- result = 4.30918e+06 @ 1000000 loops.
[LOOP] norm_loop_param(): 365510 [us] -- result = 4.30918e+07 @ 10000000 loops.
[LOOP] norm_loop_param(): 3480310 [us] -- result = 4.30918e+08 @ 100000000 loops.
-------------------------------------------------------------------------1000--------{--fast}---------------------------------------------------------------------
[LOOP] norm_reduce(): 5851380 [us] -- result = 4309.18 @ 1000 loops.
[LOOP] norm_reduce(): 5884600 [us] -- result = 4309.18 @ 1000 loops.
[LOOP] norm_reduce(): 6163690 [us] -- result = 4309.18 @ 1000 loops.
[LOOP] norm_reduce(): 6029860 [us] -- result = 4309.18 @ 1000 loops.
[LOOP] norm_reduce(): 6083730 [us] -- result = 4309.18 @ 1000 loops.
[LOOP] norm_reduce(): 6132720 [us] -- result = 4309.18 @ 1000 loops.
[LOOP] norm_reduce(): 6012620 [us] -- result = 4309.18 @ 1000 loops.
[LOOP] norm_reduce(): 6379020 [us] -- result = 4309.18 @ 1000 loops.
[LOOP] norm_reduce(): 5923550 [us] -- result = 4309.18 @ 1000 loops.
[LOOP] norm_reduce(): 6144660 [us] -- result = 4309.18 @ 1000 loops.
[LOOP] norm_reduce(): 8098380 [us] -- result = 4309.18 @ 1000 loops. [--fast]
[LOOP] norm_reduce(): 6215470 [us] -- result = 4309.18 @ 1000 loops. [--fast]
[LOOP] norm_reduce(): 5831670 [us] -- result = 4309.18 @ 1000 loops. [--fast]
[LOOP] norm_reduce(): 6124580 [us] -- result = 4309.18 @ 1000 loops. [--fast]
[LOOP] norm_reduce(): 6092740 [us] -- result = 4309.18 @ 1000 loops. [--fast]
[LOOP] norm_reduce(): 5811260 [us] -- result = 4309.18 @ 1000 loops. [--fast]
[LOOP] norm_reduce(): 5880400 [us] -- result = 4309.18 @ 1000 loops. [--fast]
[LOOP] norm_reduce(): 5898520 [us] -- result = 4309.18 @ 1000 loops. [--fast]
[LOOP] norm_reduce(): 6591110 [us] -- result = 4309.18 @ 1000 loops. [--fast]
[LOOP] norm_reduce(): 5876570 [us] -- result = 4309.18 @ 1000 loops. [--fast]
[LOOP] norm_reduce(): 6034180 [us] -- result = 4309.18 @ 1000 loops. [--fast]
-------------------------------------------------------------------------2000--------{--fast}---------------------------------------------------------------------
[LOOP] norm_reduce(): 12434700 [us] -- result = 8618.36 @ 2000 loops.
-------------------------------------------------------------------------3000--------{--fast}---------------------------------------------------------------------
[LOOP] norm_reduce(): 17807600 [us] -- result = 12927.5 @ 3000 loops.
-------------------------------------------------------------------------4000--------{--fast}---------------------------------------------------------------------
[LOOP] norm_reduce(): 23844300 [us] -- result = 17236.7 @ 4000 loops.
-------------------------------------------------------------------------5000--------{--fast}---------------------------------------------------------------------
[LOOP] norm_reduce(): 30557700 [us] -- result = 21545.9 @ 5000 loops.
[LOOP] norm_reduce(): 30523700 [us] -- result = 21545.9 @ 5000 loops.
[LOOP] norm_reduce(): 29404200 [us] -- result = 21545.9 @ 5000 loops.
[LOOP] norm_reduce(): 29268600 [us] -- result = 21545.9 @ 5000 loops. [--fast]
[LOOP] norm_reduce(): 29009500 [us] -- result = 21545.9 @ 5000 loops. [--fast]
[LOOP] norm_reduce(): 30388800 [us] -- result = 21545.9 @ 5000 loops. [--fast]
-------------------------------------------------------------------------6000--------{--fast}---------------------------------------------------------------------
[LOOP] norm_reduce(): 37070600 [us] -- result = 25855.1 @ 6000 loops.
-------------------------------------------------------------------------7000--------{--fast}---------------------------------------------------------------------
[LOOP] norm_reduce(): 42789200 [us] -- result = 30164.3 @ 7000 loops.
---------------------------------------------------------------------8000--------{--fast}---------------------------------------------------------------------
[LOOP] norm_reduce(): 50572700 [us] -- result = 34473.4 @ 8000 loops.
[LOOP] norm_reduce(): 49944300 [us] -- result = 34473.4 @ 8000 loops.
[LOOP] norm_reduce(): 49365600 [us] -- result = 34473.4 @ 8000 loops.
[LOOP] norm_reduce(): ~60+ // exceeded the 60 seconds limit and was terminated [Exit code: 124]
[LOOP] norm_reduce(): 50099900 [us] -- result = 34473.4 @ 8000 loops.
[LOOP] norm_reduce(): 49445500 [us] -- result = 34473.4 @ 8000 loops.
[LOOP] norm_reduce(): 49783800 [us] -- result = 34473.4 @ 8000 loops.
[LOOP] norm_reduce(): 48533400 [us] -- result = 34473.4 @ 8000 loops.
[LOOP] norm_reduce(): 48966600 [us] -- result = 34473.4 @ 8000 loops.
[LOOP] norm_reduce(): 47564700 [us] -- result = 34473.4 @ 8000 loops.
[LOOP] norm_reduce(): 47087400 [us] -- result = 34473.4 @ 8000 loops.
[LOOP] norm_reduce(): 47624300 [us] -- result = 34473.4 @ 8000 loops. [--fast]
[LOOP] norm_reduce(): ~60+ [--fast] // exceeded the 60 seconds limit and was terminated [Exit code: 124]
[LOOP] norm_reduce(): ~60+ [--fast] // exceeded the 60 seconds limit and was terminated [Exit code: 124]
[LOOP] norm_reduce(): 46887700 [us] -- result = 34473.4 @ 8000 loops. [--fast]
[LOOP] norm_reduce(): 46571800 [us] -- result = 34473.4 @ 8000 loops. [--fast]
[LOOP] norm_reduce(): 46794700 [us] -- result = 34473.4 @ 8000 loops. [--fast]
[LOOP] norm_reduce(): 46862600 [us] -- result = 34473.4 @ 8000 loops. [--fast]
[LOOP] norm_reduce(): 47348700 [us] -- result = 34473.4 @ 8000 loops. [--fast]
[LOOP] norm_reduce(): 46669500 [us] -- result = 34473.4 @ 8000 loops. [--fast]
第三个惊喜出现了——来自going into a forall do { ... }
:
虽然 [SEQ]
-nloops
-ed 代码被相关的附加开销严重破坏,但重新制定的一个小问题已经显示出来即使在单一 CPU 平台上也可以实现非常不同的性能水平(多 CPU 代码执行的性能增益应该更多)以及 --fast
compiler-switch 已在此处生成:
/* ---------------------------------------SETUP-SECTION-UNDER-TEST--*/ use Time;
/* ---------------------------------------SETUP-SECTION-UNDER-TEST--*/ var aStopWATCH_LOOP: Timer;
config const nloops = 100000000; // 1E+8
var res: atomic real;
res.write( 0.0 );
//------------------------------------------------------------------// PRE-COMPUTE:
var A1: [1 .. nloops] real; // pre-compute a tuple-element value
forall k in 1 .. nloops do // pre-compute a tuple-element value
A1[k] = (k % 5): real; // pre-compute a tuple-element value to a ( k % 5 ), ex-post typecast to real
/* ---------------------------------------------SECTION-UNDER-TEST--*/ aStopWATCH_LOOP.start();
forall i in 1 .. nloops do
{ // a[1] = ( i % 5 ): real; // pre-compute'd
res.add( norm_reduce( ( A1[i], a[1], a[2] ) ) ); // atomic.add()
// res += norm_reduce( ( ( i % 5 ): real, a[1], a[2] ) ); // non-atomic
//:49: note: The shadow variable 'res' is constant due to forall intents in this loop
}/* ---------------------------------------------SECTION-UNDER-TEST--*/ aStopWATCH_LOOP.stop(); write(
"forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: ", aStopWATCH_LOOP.elapsed( Time.TimeUnits.microseconds ), " [us] -- " );
/*
--------------------------------------------------------------------------------------------------------{-nloops-}-------{--fast}-------------
forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: 7911.0 [us] -- result = 320.196 @ 100 loops.
forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: 8055.0 [us] -- result = 3201.96 @ 1000 loops.
forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: 8002.0 [us] -- result = 32019.6 @ 10000 loops.
forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: 80685.0 [us] -- result = 3.20196e+05 @ 100000 loops.
forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: 842948 [us] -- result = 3.20196e+06 @ 1000000 loops.
forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: 8005300 [us] -- result = 3.20196e+07 @ 10000000 loops.
forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: 40358900 [us] -- result = 1.60098e+08 @ 50000000 loops.
forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: 40671200 [us] -- result = 1.60098e+08 @ 50000000 loops.
forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: 2195000 [us] -- result = 1.60098e+08 @ 50000000 loops. [--fast]
forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: 4518790 [us] -- result = 3.20196e+08 @ 100000000 loops. [--fast]
forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: 6178440 [us] -- result = 3.20196e+08 @ 100000000 loops. [--fast]
forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: 4755940 [us] -- result = 3.20196e+08 @ 100000000 loops. [--fast]
forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: 4405480 [us] -- result = 3.20196e+08 @ 100000000 loops. [--fast]
forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: 4509170 [us] -- result = 3.20196e+08 @ 100000000 loops. [--fast]
forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: 4736110 [us] -- result = 3.20196e+08 @ 100000000 loops. [--fast]
forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: 4653610 [us] -- result = 3.20196e+08 @ 100000000 loops. [--fast]
forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: 4397990 [us] -- result = 3.20196e+08 @ 100000000 loops. [--fast]
forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: 4655240 [us] -- result = 3.20196e+08 @ 100000000 loops. [--fast]
*/
在 Chapel 中,同质元组可以像很小一样使用 "vectors"
(例如,a = b + c * 3.0 + 5.0;
)。
但是,由于没有为元组提供各种数学函数,我尝试用几种方法为norm()
写一个函数并比较它们的性能。我的代码是这样的:
proc norm_3tuple( x: 3*real ): real
{
return sqrt( x[1]**2 + x[2]**2 + x[3]**2 );
}
proc norm_loop( x ): real
{
var tmp = 0.0;
for i in 1 .. x.size do
tmp += x[i]**2;
return sqrt( tmp );
}
proc norm_loop_param( x ): real
{
var tmp = 0.0;
for param i in 1 .. x.size do
tmp += x[i]**2;
return sqrt( tmp );
}
proc norm_reduce( x ): real
{
var tmp = ( + reduce x**2 );
return sqrt( tmp );
}
//.........................................................
var a = ( 1.0, 2.0, 3.0 );
// consistency check
writeln( norm_3tuple( a ) );
writeln( norm_loop( a ) );
writeln( norm_loop_param( a ) );
writeln( norm_reduce( a ) );
config const nloops = 100000000; // 1E+8
var res = 0.0;
for k in 1 .. nloops
{
a[ 1 ] = (k % 5): real;
res += norm_3tuple( a );
// res += norm_loop( a );
// res += norm_loop_param( a );
// res += norm_reduce( a );
}
writeln( "result = ", res );
我用 chpl --fast test.chpl
编译了上面的代码(Chapel v1.16 on OSX10.11 with 4 cores,通过 homebrew 安装)。然后,norm_3tuple()
、norm_loop()
和 norm_loop_param()
给出了几乎相同的速度(0.45 秒),而 norm_reduce()
慢得多(大约 30 秒)。我检查了 top
命令的输出,然后 norm_reduce()
使用了所有 4 个内核,而其他功能只使用了 1 个内核。所以我的问题是...
norm_reduce()
慢是因为reduce
并行工作 并行执行的开销很大 大于这个小元组的净计算成本?- 鉴于我们要避免
reduce
三元组,其他三个例程运行 基本上具有相同的速度。这是否意味着显式 for 循环对于 3 元组的成本可以忽略不计(例如,通过--fast
选项启用的循环展开)? - 在
norm_loop_param()
中,我也尝试过对循环变量使用param
关键字,但这几乎没有或根本没有提高性能。如果我们只对同构元组感兴趣,是否根本不需要附加param
(为了性能)?
很抱歉一下子问了很多问题,如果能有效地处理小元组,我将不胜感激advice/suggestions。非常感谢!
Is
norm_reduce()
slow becausereduce
works in parallel and the overhead for parallel execution is much greater than the net computational cost for this small tuple?
我相信你是对的,这就是正在发生的事情。减少是并行执行的,Chapel 目前不会尝试做任何智能节流来压缩这种并行性,因为工作可能无法保证(如在这种情况下),所以我认为你正在承受太多的任务开销除了与其他任务协调之外几乎没有任何工作(尽管我很惊讶差异的幅度如此之大......但我也发现我对这些事情几乎没有直觉)。将来,我们希望编译器能够序列化如此小的缩减以避免这些开销。
Given that we want to avoid
reduce
for 3-tuples, the other three routines run essentially with the same speed. Does this mean that explicitfor
-loops have negligible cost for 3-tuples (e.g., via loop unrolling enabled by--fast
option)?
Chapel 编译器不会展开 norm_loop()
中的显式 for 循环(您可以通过检查使用 --savec
生成的代码来验证这一点标志),但可能是后端编译器。或者说,与 norm_loop_param()
的展开循环相比,for 循环确实不会花费那么多。我怀疑您需要检查生成的程序集以确定是哪种情况。但我也希望后端 C 编译器能够很好地处理我们生成的代码——例如,它很容易看出这是一个 3 次迭代循环。
In
norm_loop_param()
, I have also tried usingparam
keyword for the loop variable, but this gave me little or no performance gain. If we are interested in homogeneous tuples only, is it not necessary to attachparam
at all (for performance)?
这很难给出明确的答案,因为我认为这主要是关于后端 C 编译器有多好的问题。
Ex-post备注:其实最后还有第三次精彩的表现惊喜...
性能?
基准! ...总是,没有例外,没有借口
这就是 chapel 如此出色的原因。非常感谢 Chapel 团队在过去十年中为 HPC 开发和改进了如此出色的计算工具。
完全热爱 true-[PARALLEL]
的努力,性能始终是设计实践和底层系统硬件的结果,而不仅仅是语法-构造函数授予 "bonus".
norm_reduce()
系统处理花费几毫秒只是为了设置所有启用并发的reduce
计算设施稍后只生成和 return 单个 x**2
产品到延迟中央 +
[= 的结果队列83=]-减速器-发动机求和。单个 2 CLK CPU 微指令的开销相当多,不是吗?
出于 为什么 的原因,请 review the costs of process-scheduling details and my updated criticism of Amdahl's Law original formulation.
代码基准测试 - 实际上同时带来了两个惊喜:
+++++++++++++++++++++++++++++++++++++++++++++++ <TiO.IDE>.RUN
3.74166
[SEQ] norm_loop(): 0.0 [us] -- 3.74166
[SEQ] norm_loop_param(): 0.0 [us] -- 3.74166
[PAR]: norm_reduce(): 5677.0 [us] -- 3.74166
3.74166
[SEQ] norm_loop(): 0.0 [us] -- 3.74166
[SEQ] norm_loop_param(): 1.0 [us] -- 3.74166
[PAR]: norm_reduce(): 5818.0 [us] -- 3.74166
3.74166
[SEQ] norm_loop(): 1.0 [us] -- 3.74166
[SEQ] norm_loop_param(): 2.0 [us] -- 3.74166
[PAR]: norm_reduce(): 4886.0 [us] -- 3.74166
第一个是在最初的 post 中报道的,第二个是在礼拜堂 运行 配备 --fast
之后观察到的编译器开关:
+++++++++++++++++++++++++++++++++++++++++++++++ <TiO.IDE>.+CompilerFLAG( "--fast" ).RUN
3.74166
[SEQ] norm_loop(): 1.0 [us] -- 3.74166
[SEQ] norm_loop_param(): 2.0 [us] -- 3.74166
[PAR]: norm_reduce(): 7769.0 [us] -- 3.74166
3.74166
[SEQ] norm_loop(): 0.0 [us] -- 3.74166
[SEQ] norm_loop_param(): 0.0 [us] -- 3.74166
[PAR]: norm_reduce(): 9109.0 [us] -- 3.74166
3.74166
[SEQ] norm_loop(): 1.0 [us] -- 3.74166
[SEQ] norm_loop_param(): 1.0 [us] -- 3.74166
[PAR]: norm_reduce(): 8807.0 [us] -- 3.74166
一如既往,SuperComputing2017 HPC 在技术论文或基准测试中发布的每个方面都提升了[可再现性]。
这些结果是在 Try-it-Online 赞助的 chapel 在线平台上收集的,欢迎所有感兴趣的爱好者重新 运行 和 post 他们的本地主机/集群操作性能Chapel 代码的详细信息,以便更好地记录上述观察时间的硬件系统相关可变性(为了进一步试验准备好的 运行 时序装饰代码,可以使用此 link 来创建 TiO.IDE 的全状态快照)。
/* ---------------------------------------SETUP-SECTION-UNDER-TEST--*/ use Time;
/* ---------------------------------------SETUP-SECTION-UNDER-TEST--*/ var aStopWATCH_SEQ: Timer;
/* ---------------------------------------SETUP-SECTION-UNDER-TEST--*/ var aStopWATCH_PAR: Timer;
proc norm_3tuple( x: 3*real ): real
{
return sqrt( x[1]**2 + x[2]**2 + x[3]**2 );
}
proc norm_loop( x ): real
{
/* ---------------------------------------------SECTION-UNDER-TEST--*/ aStopWATCH_SEQ.start();
var tmp = 0.0;
for i in 1 .. x.size do
tmp += x[i]**2;
/* ---------------------------------------------SECTION-UNDER-TEST--*/ aStopWATCH_SEQ.stop(); write( "[SEQ] norm_loop(): ",
aStopWATCH_SEQ.elapsed( Time.TimeUnits.microseconds ), " [us] -- " );
return sqrt( tmp );
}
proc norm_loop_param( x ): real
{
/* ---------------------------------------------SECTION-UNDER-TEST--*/ aStopWATCH_SEQ.start();
var tmp = 0.0;
for param i in 1 .. x.size do
tmp += x[i]**2;
/* ---------------------------------------------SECTION-UNDER-TEST--*/ aStopWATCH_SEQ.stop(); write( "[SEQ] norm_loop_param(): ",
aStopWATCH_SEQ.elapsed( Time.TimeUnits.microseconds ), " [us] -- " );
return sqrt( tmp );
}
proc norm_reduce( x ): real
{
/* ---------------------------------------------SECTION-UNDER-TEST--*/ aStopWATCH_PAR.start();
var tmp = ( + reduce x**2 );
/* ---------------------------------------------SECTION-UNDER-TEST--*/ aStopWATCH_PAR.stop(); write( "[PAR]: norm_reduce(): ",
aStopWATCH_PAR.elapsed( Time.TimeUnits.microseconds ), " [us] -- " );
return sqrt( tmp );
}
//.........................................................
var a = ( 1.0, 2.0, 3.0 );
// consistency check
writeln( norm_3tuple( a ) );
writeln( norm_loop( a ) );
writeln( norm_loop_param( a ) );
writeln( norm_reduce( a ) );
Scaling:
[LOOP] norm_3tuple(): 45829.0 [us] -- result = 4.30918e+06 @ 1000000 loops.
[LOOP] norm_3tuple(): 241680 [us] -- result = 4.30918e+07 @ 10000000 loops.
[LOOP] norm_3tuple(): 2387080 [us] -- result = 4.30918e+08 @ 100000000 loops.
[LOOP] norm_loop(): 72160.0 [us] -- result = 4.30918e+06 @ 1000000 loops.
[LOOP] norm_loop(): 755959 [us] -- result = 4.30918e+07 @ 10000000 loops.
[LOOP] norm_loop(): 7783740 [us] -- result = 4.30918e+08 @ 100000000 loops.
[LOOP] norm_loop_param(): 34102.0 [us] -- result = 4.30918e+06 @ 1000000 loops.
[LOOP] norm_loop_param(): 365510 [us] -- result = 4.30918e+07 @ 10000000 loops.
[LOOP] norm_loop_param(): 3480310 [us] -- result = 4.30918e+08 @ 100000000 loops.
-------------------------------------------------------------------------1000--------{--fast}---------------------------------------------------------------------
[LOOP] norm_reduce(): 5851380 [us] -- result = 4309.18 @ 1000 loops.
[LOOP] norm_reduce(): 5884600 [us] -- result = 4309.18 @ 1000 loops.
[LOOP] norm_reduce(): 6163690 [us] -- result = 4309.18 @ 1000 loops.
[LOOP] norm_reduce(): 6029860 [us] -- result = 4309.18 @ 1000 loops.
[LOOP] norm_reduce(): 6083730 [us] -- result = 4309.18 @ 1000 loops.
[LOOP] norm_reduce(): 6132720 [us] -- result = 4309.18 @ 1000 loops.
[LOOP] norm_reduce(): 6012620 [us] -- result = 4309.18 @ 1000 loops.
[LOOP] norm_reduce(): 6379020 [us] -- result = 4309.18 @ 1000 loops.
[LOOP] norm_reduce(): 5923550 [us] -- result = 4309.18 @ 1000 loops.
[LOOP] norm_reduce(): 6144660 [us] -- result = 4309.18 @ 1000 loops.
[LOOP] norm_reduce(): 8098380 [us] -- result = 4309.18 @ 1000 loops. [--fast]
[LOOP] norm_reduce(): 6215470 [us] -- result = 4309.18 @ 1000 loops. [--fast]
[LOOP] norm_reduce(): 5831670 [us] -- result = 4309.18 @ 1000 loops. [--fast]
[LOOP] norm_reduce(): 6124580 [us] -- result = 4309.18 @ 1000 loops. [--fast]
[LOOP] norm_reduce(): 6092740 [us] -- result = 4309.18 @ 1000 loops. [--fast]
[LOOP] norm_reduce(): 5811260 [us] -- result = 4309.18 @ 1000 loops. [--fast]
[LOOP] norm_reduce(): 5880400 [us] -- result = 4309.18 @ 1000 loops. [--fast]
[LOOP] norm_reduce(): 5898520 [us] -- result = 4309.18 @ 1000 loops. [--fast]
[LOOP] norm_reduce(): 6591110 [us] -- result = 4309.18 @ 1000 loops. [--fast]
[LOOP] norm_reduce(): 5876570 [us] -- result = 4309.18 @ 1000 loops. [--fast]
[LOOP] norm_reduce(): 6034180 [us] -- result = 4309.18 @ 1000 loops. [--fast]
-------------------------------------------------------------------------2000--------{--fast}---------------------------------------------------------------------
[LOOP] norm_reduce(): 12434700 [us] -- result = 8618.36 @ 2000 loops.
-------------------------------------------------------------------------3000--------{--fast}---------------------------------------------------------------------
[LOOP] norm_reduce(): 17807600 [us] -- result = 12927.5 @ 3000 loops.
-------------------------------------------------------------------------4000--------{--fast}---------------------------------------------------------------------
[LOOP] norm_reduce(): 23844300 [us] -- result = 17236.7 @ 4000 loops.
-------------------------------------------------------------------------5000--------{--fast}---------------------------------------------------------------------
[LOOP] norm_reduce(): 30557700 [us] -- result = 21545.9 @ 5000 loops.
[LOOP] norm_reduce(): 30523700 [us] -- result = 21545.9 @ 5000 loops.
[LOOP] norm_reduce(): 29404200 [us] -- result = 21545.9 @ 5000 loops.
[LOOP] norm_reduce(): 29268600 [us] -- result = 21545.9 @ 5000 loops. [--fast]
[LOOP] norm_reduce(): 29009500 [us] -- result = 21545.9 @ 5000 loops. [--fast]
[LOOP] norm_reduce(): 30388800 [us] -- result = 21545.9 @ 5000 loops. [--fast]
-------------------------------------------------------------------------6000--------{--fast}---------------------------------------------------------------------
[LOOP] norm_reduce(): 37070600 [us] -- result = 25855.1 @ 6000 loops.
-------------------------------------------------------------------------7000--------{--fast}---------------------------------------------------------------------
[LOOP] norm_reduce(): 42789200 [us] -- result = 30164.3 @ 7000 loops.
---------------------------------------------------------------------8000--------{--fast}---------------------------------------------------------------------
[LOOP] norm_reduce(): 50572700 [us] -- result = 34473.4 @ 8000 loops.
[LOOP] norm_reduce(): 49944300 [us] -- result = 34473.4 @ 8000 loops.
[LOOP] norm_reduce(): 49365600 [us] -- result = 34473.4 @ 8000 loops.
[LOOP] norm_reduce(): ~60+ // exceeded the 60 seconds limit and was terminated [Exit code: 124]
[LOOP] norm_reduce(): 50099900 [us] -- result = 34473.4 @ 8000 loops.
[LOOP] norm_reduce(): 49445500 [us] -- result = 34473.4 @ 8000 loops.
[LOOP] norm_reduce(): 49783800 [us] -- result = 34473.4 @ 8000 loops.
[LOOP] norm_reduce(): 48533400 [us] -- result = 34473.4 @ 8000 loops.
[LOOP] norm_reduce(): 48966600 [us] -- result = 34473.4 @ 8000 loops.
[LOOP] norm_reduce(): 47564700 [us] -- result = 34473.4 @ 8000 loops.
[LOOP] norm_reduce(): 47087400 [us] -- result = 34473.4 @ 8000 loops.
[LOOP] norm_reduce(): 47624300 [us] -- result = 34473.4 @ 8000 loops. [--fast]
[LOOP] norm_reduce(): ~60+ [--fast] // exceeded the 60 seconds limit and was terminated [Exit code: 124]
[LOOP] norm_reduce(): ~60+ [--fast] // exceeded the 60 seconds limit and was terminated [Exit code: 124]
[LOOP] norm_reduce(): 46887700 [us] -- result = 34473.4 @ 8000 loops. [--fast]
[LOOP] norm_reduce(): 46571800 [us] -- result = 34473.4 @ 8000 loops. [--fast]
[LOOP] norm_reduce(): 46794700 [us] -- result = 34473.4 @ 8000 loops. [--fast]
[LOOP] norm_reduce(): 46862600 [us] -- result = 34473.4 @ 8000 loops. [--fast]
[LOOP] norm_reduce(): 47348700 [us] -- result = 34473.4 @ 8000 loops. [--fast]
[LOOP] norm_reduce(): 46669500 [us] -- result = 34473.4 @ 8000 loops. [--fast]
第三个惊喜出现了——来自going into a forall do { ... }
:
虽然 [SEQ]
-nloops
-ed 代码被相关的附加开销严重破坏,但重新制定的一个小问题已经显示出来即使在单一 CPU 平台上也可以实现非常不同的性能水平(多 CPU 代码执行的性能增益应该更多)以及 --fast
compiler-switch 已在此处生成:
/* ---------------------------------------SETUP-SECTION-UNDER-TEST--*/ use Time;
/* ---------------------------------------SETUP-SECTION-UNDER-TEST--*/ var aStopWATCH_LOOP: Timer;
config const nloops = 100000000; // 1E+8
var res: atomic real;
res.write( 0.0 );
//------------------------------------------------------------------// PRE-COMPUTE:
var A1: [1 .. nloops] real; // pre-compute a tuple-element value
forall k in 1 .. nloops do // pre-compute a tuple-element value
A1[k] = (k % 5): real; // pre-compute a tuple-element value to a ( k % 5 ), ex-post typecast to real
/* ---------------------------------------------SECTION-UNDER-TEST--*/ aStopWATCH_LOOP.start();
forall i in 1 .. nloops do
{ // a[1] = ( i % 5 ): real; // pre-compute'd
res.add( norm_reduce( ( A1[i], a[1], a[2] ) ) ); // atomic.add()
// res += norm_reduce( ( ( i % 5 ): real, a[1], a[2] ) ); // non-atomic
//:49: note: The shadow variable 'res' is constant due to forall intents in this loop
}/* ---------------------------------------------SECTION-UNDER-TEST--*/ aStopWATCH_LOOP.stop(); write(
"forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: ", aStopWATCH_LOOP.elapsed( Time.TimeUnits.microseconds ), " [us] -- " );
/*
--------------------------------------------------------------------------------------------------------{-nloops-}-------{--fast}-------------
forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: 7911.0 [us] -- result = 320.196 @ 100 loops.
forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: 8055.0 [us] -- result = 3201.96 @ 1000 loops.
forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: 8002.0 [us] -- result = 32019.6 @ 10000 loops.
forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: 80685.0 [us] -- result = 3.20196e+05 @ 100000 loops.
forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: 842948 [us] -- result = 3.20196e+06 @ 1000000 loops.
forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: 8005300 [us] -- result = 3.20196e+07 @ 10000000 loops.
forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: 40358900 [us] -- result = 1.60098e+08 @ 50000000 loops.
forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: 40671200 [us] -- result = 1.60098e+08 @ 50000000 loops.
forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: 2195000 [us] -- result = 1.60098e+08 @ 50000000 loops. [--fast]
forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: 4518790 [us] -- result = 3.20196e+08 @ 100000000 loops. [--fast]
forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: 6178440 [us] -- result = 3.20196e+08 @ 100000000 loops. [--fast]
forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: 4755940 [us] -- result = 3.20196e+08 @ 100000000 loops. [--fast]
forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: 4405480 [us] -- result = 3.20196e+08 @ 100000000 loops. [--fast]
forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: 4509170 [us] -- result = 3.20196e+08 @ 100000000 loops. [--fast]
forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: 4736110 [us] -- result = 3.20196e+08 @ 100000000 loops. [--fast]
forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: 4653610 [us] -- result = 3.20196e+08 @ 100000000 loops. [--fast]
forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: 4397990 [us] -- result = 3.20196e+08 @ 100000000 loops. [--fast]
forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: 4655240 [us] -- result = 3.20196e+08 @ 100000000 loops. [--fast]
*/