将元组高效处理为固定大小的向量

Question

在 Chapel 中，同质元组可以像很小一样使用 "vectors"
（例如，a = b + c * 3.0 + 5.0;）。

但是，由于没有为元组提供各种数学函数，我尝试用几种方法为norm()写一个函数并比较它们的性能。我的代码是这样的：

proc norm_3tuple( x: 3*real ): real
{
    return sqrt( x[1]**2 + x[2]**2 + x[3]**2 );
}

proc norm_loop( x ): real
{
    var tmp = 0.0;
    for i in 1 .. x.size do
        tmp += x[i]**2;
    return sqrt( tmp );
}

proc norm_loop_param( x ): real
{
    var tmp = 0.0;
    for param i in 1 .. x.size do
        tmp += x[i]**2;
    return sqrt( tmp );
}

proc norm_reduce( x ): real
{
    var tmp = ( + reduce x**2 );
    return sqrt( tmp );
}

//.........................................................

var a = ( 1.0, 2.0, 3.0 );

// consistency check
writeln( norm_3tuple(     a ) );
writeln( norm_loop(       a ) );
writeln( norm_loop_param( a ) );
writeln( norm_reduce(     a ) );

config const nloops = 100000000;  // 1E+8

var res = 0.0;
for k in 1 .. nloops
{
    a[ 1 ] = (k % 5): real;

    res += norm_3tuple(     a );
 // res += norm_loop(       a );
 // res += norm_loop_param( a );
 // res += norm_reduce(     a );
}

writeln( "result = ", res );

我用 chpl --fast test.chpl 编译了上面的代码（Chapel v1.16 on OSX10.11 with 4 cores，通过 homebrew 安装）。然后，norm_3tuple()、norm_loop() 和 norm_loop_param() 给出了几乎相同的速度（0.45 秒），而 norm_reduce() 慢得多（大约 30 秒）。我检查了 top 命令的输出，然后 norm_reduce() 使用了所有 4 个内核，而其他功能只使用了 1 个内核。所以我的问题是...

norm_reduce() 慢是因为 reduce 并行工作并行执行的开销很大大于这个小元组的净计算成本？
鉴于我们要避免reduce 三元组，其他三个例程运行基本上具有相同的速度。这是否意味着显式 for 循环对于 3 元组的成本可以忽略不计（例如，通过 --fast 选项启用的循环展开）？
在 norm_loop_param() 中，我也尝试过对循环变量使用 param 关键字，但这几乎没有或根本没有提高性能。如果我们只对同构元组感兴趣，是否根本不需要附加 param（为了性能）？

很抱歉一下子问了很多问题，如果能有效地处理小元组，我将不胜感激advice/suggestions。非常感谢！

Answer 1

Is norm_reduce() slow because reduce works in parallel and the overhead for parallel execution is much greater than the net computational cost for this small tuple?

我相信你是对的，这就是正在发生的事情。减少是并行执行的，Chapel 目前不会尝试做任何智能节流来压缩这种并行性，因为工作可能无法保证（如在这种情况下），所以我认为你正在承受太多的任务开销除了与其他任务协调之外几乎没有任何工作（尽管我很惊讶差异的幅度如此之大......但我也发现我对这些事情几乎没有直觉）。将来，我们希望编译器能够序列化如此小的缩减以避免这些开销。

Given that we want to avoid reduce for 3-tuples, the other three routines run essentially with the same speed. Does this mean that explicit for-loops have negligible cost for 3-tuples (e.g., via loop unrolling enabled by --fast option)?

Chapel 编译器不会展开 norm_loop() 中的显式 for 循环（您可以通过检查使用 --savec 生成的代码来验证这一点标志），但可能是后端编译器。或者说，与 norm_loop_param() 的展开循环相比，for 循环确实不会花费那么多。我怀疑您需要检查生成的程序集以确定是哪种情况。但我也希望后端 C 编译器能够很好地处理我们生成的代码——例如，它很容易看出这是一个 3 次迭代循环。

In norm_loop_param(), I have also tried using param keyword for the loop variable, but this gave me little or no performance gain. If we are interested in homogeneous tuples only, is it not necessary to attach param at all (for performance)?

这很难给出明确的答案，因为我认为这主要是关于后端 C 编译器有多好的问题。

Answer 2

_{Ex-post备注:其实最后还有第三次精彩的表现惊喜...}

性能？
基准！_{...总是，没有例外，没有借口}

这就是 chapel 如此出色的原因。非常感谢 Chapel 团队在过去十年中为 HPC 开发和改进了如此出色的计算工具。

完全热爱 true-[PARALLEL] 的努力，性能始终是设计实践和底层系统硬件的结果，而不仅仅是语法-构造函数授予 "bonus".

norm_reduce()系统处理花费几毫秒只是为了设置所有启用并发的reduce 计算设施稍后只生成和 return 单个 x**2 产品到延迟中央 +[= 的结果队列83=]-减速器-发动机求和。单个 2 CLK CPU 微指令的开销相当多，不是吗？

出于 为什么 的原因，请 review the costs of process-scheduling details and my updated criticism of Amdahl's Law original formulation.

代码基准测试 - 实际上同时带来了两个惊喜：

+++++++++++++++++++++++++++++++++++++++++++++++ <TiO.IDE>.RUN 3.74166 [SEQ] norm_loop(): 0.0 [us] -- 3.74166 [SEQ] norm_loop_param(): 0.0 [us] -- 3.74166 [PAR]: norm_reduce(): 5677.0 [us] -- 3.74166 3.74166 [SEQ] norm_loop(): 0.0 [us] -- 3.74166 [SEQ] norm_loop_param(): 1.0 [us] -- 3.74166 [PAR]: norm_reduce(): 5818.0 [us] -- 3.74166 3.74166 [SEQ] norm_loop(): 1.0 [us] -- 3.74166 [SEQ] norm_loop_param(): 2.0 [us] -- 3.74166 [PAR]: norm_reduce(): 4886.0 [us] -- 3.74166

第一个是在最初的 post 中报道的，第二个是在礼拜堂运行配备 --fast 之后观察到的编译器开关：

+++++++++++++++++++++++++++++++++++++++++++++++ <TiO.IDE>.+CompilerFLAG( "--fast" ).RUN 3.74166 [SEQ] norm_loop(): 1.0 [us] -- 3.74166 [SEQ] norm_loop_param(): 2.0 [us] -- 3.74166 [PAR]: norm_reduce(): 7769.0 [us] -- 3.74166 3.74166 [SEQ] norm_loop(): 0.0 [us] -- 3.74166 [SEQ] norm_loop_param(): 0.0 [us] -- 3.74166 [PAR]: norm_reduce(): 9109.0 [us] -- 3.74166 3.74166 [SEQ] norm_loop(): 1.0 [us] -- 3.74166 [SEQ] norm_loop_param(): 1.0 [us] -- 3.74166 [PAR]: norm_reduce(): 8807.0 [us] -- 3.74166

一如既往，SuperComputing2017 HPC 在技术论文或基准测试中发布的每个方面都提升了[可再现性]。

这些结果是在 Try-it-Online 赞助的 chapel 在线平台上收集的，欢迎所有感兴趣的爱好者重新运行和 post 他们的本地主机/集群操作性能Chapel 代码的详细信息，以便更好地记录上述观察时间的硬件系统相关可变性（为了进一步试验准备好的运行时序装饰代码，可以使用此 link 来创建 TiO.IDE 的全状态快照）。

/* ---------------------------------------SETUP-SECTION-UNDER-TEST--*/ use Time; /* ---------------------------------------SETUP-SECTION-UNDER-TEST--*/ var aStopWATCH_SEQ: Timer; /* ---------------------------------------SETUP-SECTION-UNDER-TEST--*/ var aStopWATCH_PAR: Timer; proc norm_3tuple( x: 3*real ): real { return sqrt( x[1]**2 + x[2]**2 + x[3]**2 ); } proc norm_loop( x ): real { /* ---------------------------------------------SECTION-UNDER-TEST--*/ aStopWATCH_SEQ.start(); var tmp = 0.0; for i in 1 .. x.size do tmp += x[i]**2; /* ---------------------------------------------SECTION-UNDER-TEST--*/ aStopWATCH_SEQ.stop(); write( "[SEQ] norm_loop(): ", aStopWATCH_SEQ.elapsed( Time.TimeUnits.microseconds ), " [us] -- " ); return sqrt( tmp ); } proc norm_loop_param( x ): real { /* ---------------------------------------------SECTION-UNDER-TEST--*/ aStopWATCH_SEQ.start(); var tmp = 0.0; for param i in 1 .. x.size do tmp += x[i]**2; /* ---------------------------------------------SECTION-UNDER-TEST--*/ aStopWATCH_SEQ.stop(); write( "[SEQ] norm_loop_param(): ", aStopWATCH_SEQ.elapsed( Time.TimeUnits.microseconds ), " [us] -- " ); return sqrt( tmp ); } proc norm_reduce( x ): real { /* ---------------------------------------------SECTION-UNDER-TEST--*/ aStopWATCH_PAR.start(); var tmp = ( + reduce x**2 ); /* ---------------------------------------------SECTION-UNDER-TEST--*/ aStopWATCH_PAR.stop(); write( "[PAR]: norm_reduce(): ", aStopWATCH_PAR.elapsed( Time.TimeUnits.microseconds ), " [us] -- " ); return sqrt( tmp ); } //......................................................... var a = ( 1.0, 2.0, 3.0 ); // consistency check writeln( norm_3tuple( a ) ); writeln( norm_loop( a ) ); writeln( norm_loop_param( a ) ); writeln( norm_reduce( a ) );

Scaling:

[LOOP] norm_3tuple(): 45829.0 [us] -- result = 4.30918e+06 @ 1000000 loops. [LOOP] norm_3tuple(): 241680 [us] -- result = 4.30918e+07 @ 10000000 loops. [LOOP] norm_3tuple(): 2387080 [us] -- result = 4.30918e+08 @ 100000000 loops.

[LOOP] norm_loop(): 72160.0 [us] -- result = 4.30918e+06 @ 1000000 loops. [LOOP] norm_loop(): 755959 [us] -- result = 4.30918e+07 @ 10000000 loops. [LOOP] norm_loop(): 7783740 [us] -- result = 4.30918e+08 @ 100000000 loops.

[LOOP] norm_loop_param(): 34102.0 [us] -- result = 4.30918e+06 @ 1000000 loops. [LOOP] norm_loop_param(): 365510 [us] -- result = 4.30918e+07 @ 10000000 loops. [LOOP] norm_loop_param(): 3480310 [us] -- result = 4.30918e+08 @ 100000000 loops.

-------------------------------------------------------------------------1000--------{--fast}--------------------------------------------------------------------- [LOOP] norm_reduce(): 5851380 [us] -- result = 4309.18 @ 1000 loops. [LOOP] norm_reduce(): 5884600 [us] -- result = 4309.18 @ 1000 loops. [LOOP] norm_reduce(): 6163690 [us] -- result = 4309.18 @ 1000 loops. [LOOP] norm_reduce(): 6029860 [us] -- result = 4309.18 @ 1000 loops. [LOOP] norm_reduce(): 6083730 [us] -- result = 4309.18 @ 1000 loops. [LOOP] norm_reduce(): 6132720 [us] -- result = 4309.18 @ 1000 loops. [LOOP] norm_reduce(): 6012620 [us] -- result = 4309.18 @ 1000 loops. [LOOP] norm_reduce(): 6379020 [us] -- result = 4309.18 @ 1000 loops. [LOOP] norm_reduce(): 5923550 [us] -- result = 4309.18 @ 1000 loops. [LOOP] norm_reduce(): 6144660 [us] -- result = 4309.18 @ 1000 loops. [LOOP] norm_reduce(): 8098380 [us] -- result = 4309.18 @ 1000 loops. [--fast] [LOOP] norm_reduce(): 6215470 [us] -- result = 4309.18 @ 1000 loops. [--fast] [LOOP] norm_reduce(): 5831670 [us] -- result = 4309.18 @ 1000 loops. [--fast] [LOOP] norm_reduce(): 6124580 [us] -- result = 4309.18 @ 1000 loops. [--fast] [LOOP] norm_reduce(): 6092740 [us] -- result = 4309.18 @ 1000 loops. [--fast] [LOOP] norm_reduce(): 5811260 [us] -- result = 4309.18 @ 1000 loops. [--fast] [LOOP] norm_reduce(): 5880400 [us] -- result = 4309.18 @ 1000 loops. [--fast] [LOOP] norm_reduce(): 5898520 [us] -- result = 4309.18 @ 1000 loops. [--fast] [LOOP] norm_reduce(): 6591110 [us] -- result = 4309.18 @ 1000 loops. [--fast] [LOOP] norm_reduce(): 5876570 [us] -- result = 4309.18 @ 1000 loops. [--fast] [LOOP] norm_reduce(): 6034180 [us] -- result = 4309.18 @ 1000 loops. [--fast] -------------------------------------------------------------------------2000--------{--fast}--------------------------------------------------------------------- [LOOP] norm_reduce(): 12434700 [us] -- result = 8618.36 @ 2000 loops. -------------------------------------------------------------------------3000--------{--fast}--------------------------------------------------------------------- [LOOP] norm_reduce(): 17807600 [us] -- result = 12927.5 @ 3000 loops. -------------------------------------------------------------------------4000--------{--fast}--------------------------------------------------------------------- [LOOP] norm_reduce(): 23844300 [us] -- result = 17236.7 @ 4000 loops. -------------------------------------------------------------------------5000--------{--fast}--------------------------------------------------------------------- [LOOP] norm_reduce(): 30557700 [us] -- result = 21545.9 @ 5000 loops. [LOOP] norm_reduce(): 30523700 [us] -- result = 21545.9 @ 5000 loops. [LOOP] norm_reduce(): 29404200 [us] -- result = 21545.9 @ 5000 loops. [LOOP] norm_reduce(): 29268600 [us] -- result = 21545.9 @ 5000 loops. [--fast] [LOOP] norm_reduce(): 29009500 [us] -- result = 21545.9 @ 5000 loops. [--fast] [LOOP] norm_reduce(): 30388800 [us] -- result = 21545.9 @ 5000 loops. [--fast] -------------------------------------------------------------------------6000--------{--fast}--------------------------------------------------------------------- [LOOP] norm_reduce(): 37070600 [us] -- result = 25855.1 @ 6000 loops. -------------------------------------------------------------------------7000--------{--fast}--------------------------------------------------------------------- [LOOP] norm_reduce(): 42789200 [us] -- result = 30164.3 @ 7000 loops. ---------------------------------------------------------------------8000--------{--fast}--------------------------------------------------------------------- [LOOP] norm_reduce(): 50572700 [us] -- result = 34473.4 @ 8000 loops. [LOOP] norm_reduce(): 49944300 [us] -- result = 34473.4 @ 8000 loops. [LOOP] norm_reduce(): 49365600 [us] -- result = 34473.4 @ 8000 loops. [LOOP] norm_reduce(): ~60+ // exceeded the 60 seconds limit and was terminated [Exit code: 124] [LOOP] norm_reduce(): 50099900 [us] -- result = 34473.4 @ 8000 loops. [LOOP] norm_reduce(): 49445500 [us] -- result = 34473.4 @ 8000 loops. [LOOP] norm_reduce(): 49783800 [us] -- result = 34473.4 @ 8000 loops. [LOOP] norm_reduce(): 48533400 [us] -- result = 34473.4 @ 8000 loops. [LOOP] norm_reduce(): 48966600 [us] -- result = 34473.4 @ 8000 loops. [LOOP] norm_reduce(): 47564700 [us] -- result = 34473.4 @ 8000 loops. [LOOP] norm_reduce(): 47087400 [us] -- result = 34473.4 @ 8000 loops. [LOOP] norm_reduce(): 47624300 [us] -- result = 34473.4 @ 8000 loops. [--fast] [LOOP] norm_reduce(): ~60+ [--fast] // exceeded the 60 seconds limit and was terminated [Exit code: 124] [LOOP] norm_reduce(): ~60+ [--fast] // exceeded the 60 seconds limit and was terminated [Exit code: 124] [LOOP] norm_reduce(): 46887700 [us] -- result = 34473.4 @ 8000 loops. [--fast] [LOOP] norm_reduce(): 46571800 [us] -- result = 34473.4 @ 8000 loops. [--fast] [LOOP] norm_reduce(): 46794700 [us] -- result = 34473.4 @ 8000 loops. [--fast] [LOOP] norm_reduce(): 46862600 [us] -- result = 34473.4 @ 8000 loops. [--fast] [LOOP] norm_reduce(): 47348700 [us] -- result = 34473.4 @ 8000 loops. [--fast] [LOOP] norm_reduce(): 46669500 [us] -- result = 34473.4 @ 8000 loops. [--fast]

第三个惊喜出现了——来自going into a forall do { ... }:

虽然 [SEQ]-nloops-ed 代码被相关的附加开销严重破坏，但重新制定的一个小问题已经显示出来即使在单一 CPU 平台上也可以实现非常不同的性能水平（多 CPU 代码执行的性能增益应该更多）以及 --fast compiler-switch 已在此处生成：

/* ---------------------------------------SETUP-SECTION-UNDER-TEST--*/ use Time; /* ---------------------------------------SETUP-SECTION-UNDER-TEST--*/ var aStopWATCH_LOOP: Timer; config const nloops = 100000000; // 1E+8 var res: atomic real; res.write( 0.0 ); //------------------------------------------------------------------// PRE-COMPUTE: var A1: [1 .. nloops] real; // pre-compute a tuple-element value forall k in 1 .. nloops do // pre-compute a tuple-element value A1[k] = (k % 5): real; // pre-compute a tuple-element value to a ( k % 5 ), ex-post typecast to real /* ---------------------------------------------SECTION-UNDER-TEST--*/ aStopWATCH_LOOP.start(); forall i in 1 .. nloops do { // a[1] = ( i % 5 ): real; // pre-compute'd res.add( norm_reduce( ( A1[i], a[1], a[2] ) ) ); // atomic.add() // res += norm_reduce( ( ( i % 5 ): real, a[1], a[2] ) ); // non-atomic //:49: note: The shadow variable 'res' is constant due to forall intents in this loop }/* ---------------------------------------------SECTION-UNDER-TEST--*/ aStopWATCH_LOOP.stop(); write( "forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: ", aStopWATCH_LOOP.elapsed( Time.TimeUnits.microseconds ), " [us] -- " ); /* --------------------------------------------------------------------------------------------------------{-nloops-}-------{--fast}------------- forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: 7911.0 [us] -- result = 320.196 @ 100 loops. forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: 8055.0 [us] -- result = 3201.96 @ 1000 loops. forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: 8002.0 [us] -- result = 32019.6 @ 10000 loops. forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: 80685.0 [us] -- result = 3.20196e+05 @ 100000 loops. forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: 842948 [us] -- result = 3.20196e+06 @ 1000000 loops. forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: 8005300 [us] -- result = 3.20196e+07 @ 10000000 loops. forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: 40358900 [us] -- result = 1.60098e+08 @ 50000000 loops. forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: 40671200 [us] -- result = 1.60098e+08 @ 50000000 loops. forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: 2195000 [us] -- result = 1.60098e+08 @ 50000000 loops. [--fast] forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: 4518790 [us] -- result = 3.20196e+08 @ 100000000 loops. [--fast] forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: 6178440 [us] -- result = 3.20196e+08 @ 100000000 loops. [--fast] forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: 4755940 [us] -- result = 3.20196e+08 @ 100000000 loops. [--fast] forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: 4405480 [us] -- result = 3.20196e+08 @ 100000000 loops. [--fast] forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: 4509170 [us] -- result = 3.20196e+08 @ 100000000 loops. [--fast] forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: 4736110 [us] -- result = 3.20196e+08 @ 100000000 loops. [--fast] forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: 4653610 [us] -- result = 3.20196e+08 @ 100000000 loops. [--fast] forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: 4397990 [us] -- result = 3.20196e+08 @ 100000000 loops. [--fast] forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: 4655240 [us] -- result = 3.20196e+08 @ 100000000 loops. [--fast] */

将元组高效处理为固定大小的向量

Efficient treatment of tuples as fixed-size vectors

parallel-processing

performance

tuples

parallelism-amdahl

chapel

性能？
基准！_{...总是，没有例外，没有借口}

代码基准测试 - 实际上同时带来了两个惊喜：

Scaling:

第三个惊喜出现了——来自going into a `forall do { ... }`:

将元组高效处理为固定大小的向量

Efficient treatment of tuples as fixed-size vectors

性能？ 基准！ ...总是，没有例外，没有借口

代码基准测试 - 实际上同时带来了两个惊喜：

第三个惊喜出现了——来自going into a forall do { ... }:

性能？
基准！_{...总是，没有例外，没有借口}

第三个惊喜出现了——来自going into a `forall do { ... }`: