CyclicDist 在多个区域设置上变慢
CyclicDist goes slower on multiple locales
我尝试使用 CyclicDist
模块实现矩阵乘法。
当我用一种语言环境与两种语言环境进行测试时,一种语言环境要快得多。是因为两个 Jetson nano 板之间的通信时间真的很长还是我的实现没有利用 CyclicDist
的工作方式?
这是我的代码:
use Random, Time, CyclicDist;
var t : Timer;
t.start();
config const size = 10;
const Space = {1..size, 1..size};
const gridSpace = Space dmapped Cyclic(startIdx=Space.low);
var grid: [gridSpace] real;
fillRandom(grid);
const gridSpace2 = Space dmapped Cyclic(startIdx=Space.low);
var grid2: [gridSpace2] real;
fillRandom(grid2);
const gridSpace3 = Space dmapped Cyclic(startIdx=Space.low);
var grid3: [gridSpace] real;
forall i in 1..size do {
forall j in 1..size do {
forall k in 1..size do {
grid3[i,j] += grid[i,k] * grid2[k,j];
}
}
}
t.stop();
writeln("Done!:");
writeln(t.elapsed(),"seconds");
writeln("Size of matrix was:", size);
t.clear()
我知道我的实现不是分布式内存系统的最佳选择。
Q : Is it because the time to communicate (1) between the two Jetson nano boards is really big or is my implementation (2) not taking advantage of the way CyclicDist
works?
第二个选项是肯定的选择:~ 100 x
更差 性能是在 CyclicDist
小尺寸数据上实现的。
Documentation明确warns关于这个,说:
Cyclic distribution maps indices to locales in a round-robin pattern starting at a given index.
...
Limitations
This distribution has not been tuned for performance.
对处理效率的不利影响在单一语言环境平台上是显而易见的,其中所有数据都驻留在语言环境本地内存中 space,因此从未增加任何 NUMA 板间通信附加成本。与 Vass' single-forall{}
D3
迭代求和积[=相比,仍然 ~ 100 x
更差 35=]
(直到现在才注意到 Vass 的性能驱动变化从最初的 forall-in-D3-do-{}
到另一个配置的 forall-in-D2-do-for{}
-tandem-iterated 修订版 - 到目前为止,小尺寸 - fast --ccflags -O3 performed test show almost half of length WORSE forall-in-D2-do-for{}
-iterator-in-iterator 结果的性能,甚至比 O/P 三元组更差-forall{}
原始提案,除了尺寸低于 512x512 和经过 -O3 优化后,但对于最小尺寸 128x128
最高性能达到了 ~ 850 [ns]
per原始 Vass-D3 单迭代器的单元格,令人惊讶的是没有 --ccflags -O3(对于正在处理的更大 --size={ 1024 | 2048 | 4096 | 8192 }
数据布局,可能会明显改变,如果更宽的 NUMA 多区域设置和更高的并行设备正在参加比赛 ) )
TiO.run platform uses 1 numLocales,
having 2 physical CPU-cores accessible (numPU-s)
with 2 maxTaskPar parallelism limit
CyclicDist
的使用会影响 DATA 到内存的布局,不是吗?
通过对 小尺寸 --size={128 | 256 | 512 | 640}
的测量进行验证,有和没有轻微的 --ccflags -O3
影响
// --------------------------------------------------------------------------------------------------------------------------------
// --fast
// ------
//
// For grid{1,2,3}[ 128, 128] the tested forall sum-product over dmapped Cyclic Space took 255818 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 128, 128] the tested forall sum-product took 3075 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 128, 128] the Vass-D2-k ver sum-product took 3040 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 128, 128] the tested forall sum-product took 2198 [us] excl. fillRandom()-ops
// For grid{1,2,3}[ 128, 128] the Vass-D3 orig sum-product took 1974 [us] excl. fillRandom()-ops <-- 127x SLOWER with CyclicDist dmapped DATA
// For grid{1,2,3}[ 128, 128] the Vass-D2-k ver sum-product took 2122 [us] excl. fillRandom()-ops
// For grid{1,2,3}[ 128, 128] the tested forall sum-product over dmapped Cyclic Space took 252439 [us] excl. fillRandom()-ops
//
// For grid{1,2,3}[ 256, 256] the tested forall sum-product over dmapped Cyclic Space took 2141444 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 256, 256] the tested forall sum-product took 27095 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 256, 256] the Vass-D2-k ver sum-product took 25339 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 256, 256] the tested forall sum-product took 23493 [us] excl. fillRandom()-ops
// For grid{1,2,3}[ 256, 256] the Vass-D3 orig sum-product took 21631 [us] excl. fillRandom()-ops <-- 98x SLOWER then w/o CyclicDist dmapped data
// For grid{1,2,3}[ 256, 256] the Vass-D2-k ver sum-product took 21971 [us] excl. fillRandom()-ops
// For grid{1,2,3}[ 256, 256] the tested forall sum-product over dmapped Cyclic Space took 2122417 [us] excl. fillRandom()-ops
//
// For grid{1,2,3}[ 512, 512] the tested forall sum-product over dmapped Cyclic Space took 16988685 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 512, 512] the tested forall sum-product over dmapped Cyclic Space took 17448207 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 512, 512] the tested forall sum-product took 268111 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 512, 512] the Vass-D2-k ver sum-product took 270289 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 512, 512] the tested forall sum-product took 250896 [us] excl. fillRandom()-ops
// For grid{1,2,3}[ 512, 512] the Vass-D3 orig sum-product took 239898 [us] excl. fillRandom()-ops <-- 71x SLOWER with dmapped CyclicDist DATA
// For grid{1,2,3}[ 512, 512] the Vass-D2-k ver sum-product took 257479 [us] excl. fillRandom()-ops
// For grid{1,2,3}[ 512, 512] the tested forall sum-product over dmapped Cyclic Space took 17391049 [us] excl. fillRandom()-ops
// For grid{1,2,3}[ 512, 512] the tested forall sum-product over dmapped Cyclic Space took 16932503 [us] excl. fillRandom()-ops <~~ ~2e5 [us] faster without --ccflags -O3
//
// For grid{1,2,3}[ 640, 640] the tested forall sum-product over dmapped Cyclic Space took 35136377 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 640, 640] the tested forall sum-product took 362205 [us] incl. fillRandom()-ops <-- 97x SLOWER with dmapped CyclicDist DATA
// For grid{1,2,3}[ 640, 640] the Vass-D2-k ver sum-product took 367651 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 640, 640] the tested forall sum-product took 345865 [us] excl. fillRandom()-ops
// For grid{1,2,3}[ 640, 640] the Vass-D3 orig sum-product took 337896 [us] excl. fillRandom()-ops <-- 103x SLOWER with dmapped CyclicDist DATA
// For grid{1,2,3}[ 640, 640] the Vass-D2-k ver sum-product took 351101 [us] excl. fillRandom()-ops
// For grid{1,2,3}[ 640, 640] the tested forall sum-product over dmapped Cyclic Space took 35052849 [us] excl. fillRandom()-ops <~~ ~3e4 [us] faster without --ccflags -O3
//
// --------------------------------------------------------------------------------------------------------------------------------
// --fast --ccflags -O3
// --------------------
//
// For grid{1,2,3}[ 128, 128] the tested forall sum-product over dmapped Cyclic Space took 250372 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 128, 128] the tested forall sum-product took 3189 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 128, 128] the Vass-D2-k ver sum-product took 2966 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 128, 128] the tested forall sum-product took 2284 [us] excl. fillRandom()-ops
// For grid{1,2,3}[ 128, 128] the Vass-D3 orig sum-product took 1949 [us] excl. fillRandom()-ops <-- 126x FASTER than with dmapped CyclicDist DATA
// For grid{1,2,3}[ 128, 128] the Vass-D2-k ver sum-product took 2072 [us] excl. fillRandom()-ops
// For grid{1,2,3}[ 128, 128] the tested forall sum-product over dmapped Cyclic Space took 246965 [us] excl. fillRandom()-ops
//
// For grid{1,2,3}[ 256, 256] the tested forall sum-product over dmapped Cyclic Space took 2114615 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 256, 256] the tested forall sum-product took 37775 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 256, 256] the Vass-D2-k ver sum-product took 38866 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 256, 256] the tested forall sum-product took 32384 [us] excl. fillRandom()-ops
// For grid{1,2,3}[ 256, 256] the Vass-D3 orig sum-product took 29264 [us] excl. fillRandom()-ops <-- 71x FASTER than with dmapped CyclicDist DATA
// For grid{1,2,3}[ 256, 256] the Vass-D2-k ver sum-product took 33973 [us] excl. fillRandom()-ops
// For grid{1,2,3}[ 256, 256] the tested forall sum-product over dmapped Cyclic Space took 2098344 [us] excl. fillRandom()-ops
//
// For grid{1,2,3}[ 512, 512] the tested forall sum-product over dmapped Cyclic Space took 17136826 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 512, 512] the tested forall sum-product over dmapped Cyclic Space took 17081273 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 512, 512] the tested forall sum-product took 251786 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 512, 512] the Vass-D2-k ver sum-product took 266766 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 512, 512] the tested forall sum-product took 239301 [us] excl. fillRandom()-ops
// For grid{1,2,3}[ 512, 512] the Vass-D3 orig sum-product took 233003 [us] excl. fillRandom()-ops <~~ ~6e3 [us] faster with --ccflags -O3
// For grid{1,2,3}[ 512, 512] the Vass-D2-k ver sum-product took 253642 [us] excl. fillRandom()-ops
// For grid{1,2,3}[ 512, 512] the tested forall sum-product over dmapped Cyclic Space took 17025339 [us] excl. fillRandom()-ops
// For grid{1,2,3}[ 512, 512] the tested forall sum-product over dmapped Cyclic Space took 17081352 [us] excl. fillRandom()-ops <~~ ~2e5 [us] slower with --ccflags -O3
//
// For grid{1,2,3}[ 640, 640] the tested forall sum-product over dmapped Cyclic Space took 35164630 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 640, 640] the tested forall sum-product took 363060 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 640, 640] the Vass-D2-k ver sum-product took 489529 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 640, 640] the tested forall sum-product took 345742 [us] excl. fillRandom()-ops <-- 104x SLOWER with dmapped CyclicDist DATA
// For grid{1,2,3}[ 640, 640] the Vass-D3 orig sum-product took 353353 [us] excl. fillRandom()-ops <-- 102x SLOWER with dmapped CyclicDist DATA
// For grid{1,2,3}[ 640, 640] the Vass-D2-k ver sum-product took 471213 [us] excl. fillRandom()-ops <~~~12e5 [us] slower with --ccflags -O3
// For grid{1,2,3}[ 640, 640] the tested forall sum-product over dmapped Cyclic Space took 35075435 [us] excl. fillRandom()-ops
无论如何,Chapel 团队的见解(设计方面和测试方面)都很重要。 @Brad 被要求提供类似的测试覆盖和比较主要更大尺寸 --size={1024 | 2048 | 4096 | 8192 | ...}
和具有多语言环境和多语言环境解决方案的 "way-wider"-NUMA 平台,可在 Cray 获得对于 Chapel 团队的研发,不会受到硬件和 public、赞助、共享 TiO.RUN 平台上的 ~ 60 [s]
限制的影响。
这个程序没有扩展的主要原因可能是计算从不使用除初始语言环境之外的任何语言环境。具体来说,forall 在范围内循环,就像您的代码中的那样:
forall i in 1..size do
总是运行 他们的所有迭代都使用在当前区域设置上执行的任务。这是因为范围不是 Chapel 中的分布式值,因此它们的并行迭代器不会跨区域分配工作。结果,循环体的所有 size**3 次执行:
grid3[i,j] += grid[i,k] * grid2[k,j];
将在区域设置 0 上 运行,其中 none 将在区域设置 1 上 运行。您可以通过将以下内容放入最内层循环的主体中来了解情况:
writeln("locale ", here.id, " running ", (i,j,k));
(其中 here.id
打印出当前任务所在区域设置的 ID 运行ning)。这将显示区域设置 0 是 运行ning 所有迭代:
0 running (9, 1, 1)
0 running (1, 1, 1)
0 running (1, 1, 2)
0 running (9, 1, 2)
0 running (1, 1, 3)
0 running (9, 1, 3)
0 running (1, 1, 4)
0 running (1, 1, 5)
0 running (1, 1, 6)
0 running (1, 1, 7)
0 running (1, 1, 8)
0 running (1, 1, 9)
0 running (6, 1, 1)
...
将此与 运行 在分布式域上使用 forall 循环进行对比,例如 gridSpace
:
forall (i,j) in gridSpace do
writeln("locale ", here.id, " running ", (i,j));
迭代将在语言环境之间分布:
locale 0 running (1, 1)
locale 0 running (9, 1)
locale 0 running (1, 2)
locale 0 running (9, 2)
locale 0 running (1, 3)
locale 0 running (9, 3)
locale 0 running (1, 4)
locale 1 running (8, 1)
locale 1 running (10, 1)
locale 1 running (8, 2)
locale 1 running (2, 1)
locale 1 running (8, 3)
locale 1 running (10, 2)
...
由于所有计算都在区域设置 0 上进行 运行,但一半数据位于区域设置 1 上(由于数组是分布式的),因此会生成大量通信以从区域设置中获取远程值1 的内存到区域设置 0 以便对其进行计算。
我尝试使用 CyclicDist
模块实现矩阵乘法。
当我用一种语言环境与两种语言环境进行测试时,一种语言环境要快得多。是因为两个 Jetson nano 板之间的通信时间真的很长还是我的实现没有利用 CyclicDist
的工作方式?
这是我的代码:
use Random, Time, CyclicDist;
var t : Timer;
t.start();
config const size = 10;
const Space = {1..size, 1..size};
const gridSpace = Space dmapped Cyclic(startIdx=Space.low);
var grid: [gridSpace] real;
fillRandom(grid);
const gridSpace2 = Space dmapped Cyclic(startIdx=Space.low);
var grid2: [gridSpace2] real;
fillRandom(grid2);
const gridSpace3 = Space dmapped Cyclic(startIdx=Space.low);
var grid3: [gridSpace] real;
forall i in 1..size do {
forall j in 1..size do {
forall k in 1..size do {
grid3[i,j] += grid[i,k] * grid2[k,j];
}
}
}
t.stop();
writeln("Done!:");
writeln(t.elapsed(),"seconds");
writeln("Size of matrix was:", size);
t.clear()
我知道我的实现不是分布式内存系统的最佳选择。
Q : Is it because the time to communicate (1) between the two Jetson nano boards is really big or is my implementation (2) not taking advantage of the way
CyclicDist
works?
第二个选项是肯定的选择:~ 100 x
更差 性能是在 CyclicDist
小尺寸数据上实现的。
Documentation明确warns关于这个,说:
Cyclic distribution maps indices to locales in a round-robin pattern starting at a given index.
...
Limitations
This distribution has not been tuned for performance.
对处理效率的不利影响在单一语言环境平台上是显而易见的,其中所有数据都驻留在语言环境本地内存中 space,因此从未增加任何 NUMA 板间通信附加成本。与 Vass' single-forall{}
D3
迭代求和积[=相比,仍然 ~ 100 x
更差 35=]
(直到现在才注意到 Vass 的性能驱动变化从最初的 forall-in-D3-do-{}
到另一个配置的 forall-in-D2-do-for{}
-tandem-iterated 修订版 - 到目前为止,小尺寸 - fast --ccflags -O3 performed test show almost half of length WORSE forall-in-D2-do-for{}
-iterator-in-iterator 结果的性能,甚至比 O/P 三元组更差-forall{}
原始提案,除了尺寸低于 512x512 和经过 -O3 优化后,但对于最小尺寸 128x128
最高性能达到了 ~ 850 [ns]
per原始 Vass-D3 单迭代器的单元格,令人惊讶的是没有 --ccflags -O3(对于正在处理的更大 --size={ 1024 | 2048 | 4096 | 8192 }
数据布局,可能会明显改变,如果更宽的 NUMA 多区域设置和更高的并行设备正在参加比赛 ) )
TiO.run platform uses 1 numLocales,
having 2 physical CPU-cores accessible (numPU-s)
with 2 maxTaskPar parallelism limit
CyclicDist
的使用会影响 DATA 到内存的布局,不是吗?
通过对 小尺寸 --size={128 | 256 | 512 | 640}
的测量进行验证,有和没有轻微的 --ccflags -O3
影响
// --------------------------------------------------------------------------------------------------------------------------------
// --fast
// ------
//
// For grid{1,2,3}[ 128, 128] the tested forall sum-product over dmapped Cyclic Space took 255818 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 128, 128] the tested forall sum-product took 3075 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 128, 128] the Vass-D2-k ver sum-product took 3040 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 128, 128] the tested forall sum-product took 2198 [us] excl. fillRandom()-ops
// For grid{1,2,3}[ 128, 128] the Vass-D3 orig sum-product took 1974 [us] excl. fillRandom()-ops <-- 127x SLOWER with CyclicDist dmapped DATA
// For grid{1,2,3}[ 128, 128] the Vass-D2-k ver sum-product took 2122 [us] excl. fillRandom()-ops
// For grid{1,2,3}[ 128, 128] the tested forall sum-product over dmapped Cyclic Space took 252439 [us] excl. fillRandom()-ops
//
// For grid{1,2,3}[ 256, 256] the tested forall sum-product over dmapped Cyclic Space took 2141444 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 256, 256] the tested forall sum-product took 27095 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 256, 256] the Vass-D2-k ver sum-product took 25339 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 256, 256] the tested forall sum-product took 23493 [us] excl. fillRandom()-ops
// For grid{1,2,3}[ 256, 256] the Vass-D3 orig sum-product took 21631 [us] excl. fillRandom()-ops <-- 98x SLOWER then w/o CyclicDist dmapped data
// For grid{1,2,3}[ 256, 256] the Vass-D2-k ver sum-product took 21971 [us] excl. fillRandom()-ops
// For grid{1,2,3}[ 256, 256] the tested forall sum-product over dmapped Cyclic Space took 2122417 [us] excl. fillRandom()-ops
//
// For grid{1,2,3}[ 512, 512] the tested forall sum-product over dmapped Cyclic Space took 16988685 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 512, 512] the tested forall sum-product over dmapped Cyclic Space took 17448207 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 512, 512] the tested forall sum-product took 268111 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 512, 512] the Vass-D2-k ver sum-product took 270289 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 512, 512] the tested forall sum-product took 250896 [us] excl. fillRandom()-ops
// For grid{1,2,3}[ 512, 512] the Vass-D3 orig sum-product took 239898 [us] excl. fillRandom()-ops <-- 71x SLOWER with dmapped CyclicDist DATA
// For grid{1,2,3}[ 512, 512] the Vass-D2-k ver sum-product took 257479 [us] excl. fillRandom()-ops
// For grid{1,2,3}[ 512, 512] the tested forall sum-product over dmapped Cyclic Space took 17391049 [us] excl. fillRandom()-ops
// For grid{1,2,3}[ 512, 512] the tested forall sum-product over dmapped Cyclic Space took 16932503 [us] excl. fillRandom()-ops <~~ ~2e5 [us] faster without --ccflags -O3
//
// For grid{1,2,3}[ 640, 640] the tested forall sum-product over dmapped Cyclic Space took 35136377 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 640, 640] the tested forall sum-product took 362205 [us] incl. fillRandom()-ops <-- 97x SLOWER with dmapped CyclicDist DATA
// For grid{1,2,3}[ 640, 640] the Vass-D2-k ver sum-product took 367651 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 640, 640] the tested forall sum-product took 345865 [us] excl. fillRandom()-ops
// For grid{1,2,3}[ 640, 640] the Vass-D3 orig sum-product took 337896 [us] excl. fillRandom()-ops <-- 103x SLOWER with dmapped CyclicDist DATA
// For grid{1,2,3}[ 640, 640] the Vass-D2-k ver sum-product took 351101 [us] excl. fillRandom()-ops
// For grid{1,2,3}[ 640, 640] the tested forall sum-product over dmapped Cyclic Space took 35052849 [us] excl. fillRandom()-ops <~~ ~3e4 [us] faster without --ccflags -O3
//
// --------------------------------------------------------------------------------------------------------------------------------
// --fast --ccflags -O3
// --------------------
//
// For grid{1,2,3}[ 128, 128] the tested forall sum-product over dmapped Cyclic Space took 250372 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 128, 128] the tested forall sum-product took 3189 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 128, 128] the Vass-D2-k ver sum-product took 2966 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 128, 128] the tested forall sum-product took 2284 [us] excl. fillRandom()-ops
// For grid{1,2,3}[ 128, 128] the Vass-D3 orig sum-product took 1949 [us] excl. fillRandom()-ops <-- 126x FASTER than with dmapped CyclicDist DATA
// For grid{1,2,3}[ 128, 128] the Vass-D2-k ver sum-product took 2072 [us] excl. fillRandom()-ops
// For grid{1,2,3}[ 128, 128] the tested forall sum-product over dmapped Cyclic Space took 246965 [us] excl. fillRandom()-ops
//
// For grid{1,2,3}[ 256, 256] the tested forall sum-product over dmapped Cyclic Space took 2114615 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 256, 256] the tested forall sum-product took 37775 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 256, 256] the Vass-D2-k ver sum-product took 38866 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 256, 256] the tested forall sum-product took 32384 [us] excl. fillRandom()-ops
// For grid{1,2,3}[ 256, 256] the Vass-D3 orig sum-product took 29264 [us] excl. fillRandom()-ops <-- 71x FASTER than with dmapped CyclicDist DATA
// For grid{1,2,3}[ 256, 256] the Vass-D2-k ver sum-product took 33973 [us] excl. fillRandom()-ops
// For grid{1,2,3}[ 256, 256] the tested forall sum-product over dmapped Cyclic Space took 2098344 [us] excl. fillRandom()-ops
//
// For grid{1,2,3}[ 512, 512] the tested forall sum-product over dmapped Cyclic Space took 17136826 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 512, 512] the tested forall sum-product over dmapped Cyclic Space took 17081273 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 512, 512] the tested forall sum-product took 251786 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 512, 512] the Vass-D2-k ver sum-product took 266766 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 512, 512] the tested forall sum-product took 239301 [us] excl. fillRandom()-ops
// For grid{1,2,3}[ 512, 512] the Vass-D3 orig sum-product took 233003 [us] excl. fillRandom()-ops <~~ ~6e3 [us] faster with --ccflags -O3
// For grid{1,2,3}[ 512, 512] the Vass-D2-k ver sum-product took 253642 [us] excl. fillRandom()-ops
// For grid{1,2,3}[ 512, 512] the tested forall sum-product over dmapped Cyclic Space took 17025339 [us] excl. fillRandom()-ops
// For grid{1,2,3}[ 512, 512] the tested forall sum-product over dmapped Cyclic Space took 17081352 [us] excl. fillRandom()-ops <~~ ~2e5 [us] slower with --ccflags -O3
//
// For grid{1,2,3}[ 640, 640] the tested forall sum-product over dmapped Cyclic Space took 35164630 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 640, 640] the tested forall sum-product took 363060 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 640, 640] the Vass-D2-k ver sum-product took 489529 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 640, 640] the tested forall sum-product took 345742 [us] excl. fillRandom()-ops <-- 104x SLOWER with dmapped CyclicDist DATA
// For grid{1,2,3}[ 640, 640] the Vass-D3 orig sum-product took 353353 [us] excl. fillRandom()-ops <-- 102x SLOWER with dmapped CyclicDist DATA
// For grid{1,2,3}[ 640, 640] the Vass-D2-k ver sum-product took 471213 [us] excl. fillRandom()-ops <~~~12e5 [us] slower with --ccflags -O3
// For grid{1,2,3}[ 640, 640] the tested forall sum-product over dmapped Cyclic Space took 35075435 [us] excl. fillRandom()-ops
无论如何,Chapel 团队的见解(设计方面和测试方面)都很重要。 @Brad 被要求提供类似的测试覆盖和比较主要更大尺寸 --size={1024 | 2048 | 4096 | 8192 | ...}
和具有多语言环境和多语言环境解决方案的 "way-wider"-NUMA 平台,可在 Cray 获得对于 Chapel 团队的研发,不会受到硬件和 public、赞助、共享 TiO.RUN 平台上的 ~ 60 [s]
限制的影响。
这个程序没有扩展的主要原因可能是计算从不使用除初始语言环境之外的任何语言环境。具体来说,forall 在范围内循环,就像您的代码中的那样:
forall i in 1..size do
总是运行 他们的所有迭代都使用在当前区域设置上执行的任务。这是因为范围不是 Chapel 中的分布式值,因此它们的并行迭代器不会跨区域分配工作。结果,循环体的所有 size**3 次执行:
grid3[i,j] += grid[i,k] * grid2[k,j];
将在区域设置 0 上 运行,其中 none 将在区域设置 1 上 运行。您可以通过将以下内容放入最内层循环的主体中来了解情况:
writeln("locale ", here.id, " running ", (i,j,k));
(其中 here.id
打印出当前任务所在区域设置的 ID 运行ning)。这将显示区域设置 0 是 运行ning 所有迭代:
0 running (9, 1, 1)
0 running (1, 1, 1)
0 running (1, 1, 2)
0 running (9, 1, 2)
0 running (1, 1, 3)
0 running (9, 1, 3)
0 running (1, 1, 4)
0 running (1, 1, 5)
0 running (1, 1, 6)
0 running (1, 1, 7)
0 running (1, 1, 8)
0 running (1, 1, 9)
0 running (6, 1, 1)
...
将此与 运行 在分布式域上使用 forall 循环进行对比,例如 gridSpace
:
forall (i,j) in gridSpace do
writeln("locale ", here.id, " running ", (i,j));
迭代将在语言环境之间分布:
locale 0 running (1, 1)
locale 0 running (9, 1)
locale 0 running (1, 2)
locale 0 running (9, 2)
locale 0 running (1, 3)
locale 0 running (9, 3)
locale 0 running (1, 4)
locale 1 running (8, 1)
locale 1 running (10, 1)
locale 1 running (8, 2)
locale 1 running (2, 1)
locale 1 running (8, 3)
locale 1 running (10, 2)
...
由于所有计算都在区域设置 0 上进行 运行,但一半数据位于区域设置 1 上(由于数组是分布式的),因此会生成大量通信以从区域设置中获取远程值1 的内存到区域设置 0 以便对其进行计算。