有没有办法在 Chapel 中自定义整个数组语句的默认并行化行为？

Question

根据 Chapel 的可用文档，（整个）数组像

这样的语句

A = B + alpha * C;   // with A, B, and C being arrays, and alpha some scalar

在所有迭代中以如下语言实现：

forall (a,b,c) in zip(A,B,C) do
   a = b + alpha * c;

因此，数组语句默认由一组并行执行线程。不幸的是，这似乎也完全排除了此类语句的（部分或完整）向量化。这个对于习惯于 Fortran 或 Python/Numpy 等语言（默认行为通常是仅对数组语句进行矢量化）的程序员来说，这可能会导致性能意外。

对于使用（整个）数组语句的代码，数组小到大小适中，向量化损失（由 Linux 硬件确认性能计数器）和固有的显着开销并行线程（不适合有效地利用此类问题中可用的细粒度数据并行性）可能会导致在显着的性能损失。作为一个例子，考虑以下版本的 Jacobi 迭代都解决了相同的问题在 300 x 300 个区域的域上：

Jacobi_1使用数组语句，如下：

/*
 *  Jacobi_1
 *
 *  This program (adapted from the Chapel distribution) performs
 *  niter iterations of the Jacobi method for the Laplace equation
 *  using (whole-)array statements.
 *
 */

config var n = 300;                  // size of n x n grid
config var niter = 10000;            // number of iterations to perform

proc main() {

  const Domain = {0..n+1,0..n+1};    // domain including boundary points

  var iteration = 0;                 // iteration counter
  var X, XNew: [Domain] real = 0.0;  // declare arrays: 
                                     //   X stores approximate solution
                                     //   XNew stores the next solution  
  X[n+1,1..n] = 1.0;                 // Set south boundary values to 1.0

  do {

    // compute next approximation
    XNew[1..n,1..n] =
      ( X[0..n-1,1..n] + X[2..n+1,1..n] +
        X[1..n,2..n+1] + X[1..n,0..n-1] ) / 4.0;

    // update X with next approximation
    X[1..n,1..n] = XNew[1..n,1..n];

    // advance iteration counter
    iteration += 1;

  } while (iteration < niter);

  writeln("Jacobi computation complete.");
  writeln("# of iterations: ", iteration);

} // main

Jacobi_2 在整个过程中使用串行 for 循环（即仅（自动）矢量化允许后端 C 编译器):

/*
 *  Jacobi_2
 *
 *  This program (adapted from the Chapel distribution) performs
 *  niter iterations of the Jacobi method for the Laplace equation
 *  using (serial) for-loops.
 *
 */

config var n = 300;                  // size of n x n grid
config var niter = 10000;            // number of iterations to perform

proc main() {

  const Domain = {0..n+1,0..n+1};    // domain including boundary points

  var iteration = 0;                 // iteration counter
  var X, XNew: [Domain] real = 0.0;  // declare arrays: 
                                     //   X stores approximate solution
                                     //   XNew stores the next solution  
  for j in 1..n do
    X[n+1,j] = 1.0;                  // Set south boundary values to 1.0

  do {

    // compute next approximation
    for i in 1..n do
      for j in 1..n do  
        XNew[i,j] = ( X[i-1,j] + X[i+1,j] +
                      X[i,j+1] + X[i,j-1] ) / 4.0;

    // update X with next approximation
    for i in 1..n do
      for j in 1..n do
        X[i,j] = XNew[i,j];

    // advance iteration counter
    iteration += 1;

  } while (iteration < niter);

  writeln("Jacobi computation complete.");
  writeln("# of iterations: ", iteration);

} // main

Jacobi_3，最后，对最内层的循环进行了向量化，并且只有最外层循环线程：

/*
 *  Jacobi_3
 *
 *  This program (adapted from the Chapel distribution) performs
 *  niter iterations of the Jacobi method for the Laplace equation
 *  using both parallel and serial (vectorized) loops.
 *
 */

config var n = 300;                  // size of n x n grid
config var niter = 10000;            // number of iterations to perform

proc main() {

  const Domain = {0..n+1,0..n+1};    // domain including boundary points

  var iteration = 0;                 // iteration counter
  var X, XNew: [Domain] real = 0.0;  // declare arrays: 
                                     //   X stores approximate solution
                                     //   XNew stores the next solution  
  for j in vectorizeOnly(1..n) do
    X[n+1,j] = 1.0;                  // Set south boundary values to 1.0

  do {

    // compute next approximation
    forall i in 1..n do
      for j in vectorizeOnly(1..n) do
        XNew[i,j] = ( X[i-1,j] + X[i+1,j] +
                      X[i,j+1] + X[i,j-1] ) / 4.0;

    // update X with next approximation
    forall i in 1..n do
      for j in vectorizeOnly(1..n) do
        X[i,j] = XNew[i,j];

    // advance iteration counter
    iteration += 1;

  } while (iteration < niter);

  writeln("Jacobi computation complete.");
  writeln("# of iterations: ", iteration);

} // main

运行这些代码在具有 2 个处理器内核并使用两个并行线程，人们发现 Jacobi_1 是（令人惊讶的）比 Jacobi_2 慢十倍以上，它本身是（预期的）比 Jacobi_3.

慢 ~1.6 倍

不幸的是，这种默认行为使得数组语句完全对于我的用例没有吸引力，即使对于有益的算法也是如此极大地来自更简洁的符号和可读性 (whole-)array语句可以提供。

Chapel 中的用户是否有办法更改此默认行为？也就是说，用户可以自定义整个数组的默认并行化吗？ Jacobi_1 中使用的数组语句将表现得像 Jacobi_2 中的代码（这对代码开发和调试很有用），或 Jacobi_3 中的代码（在这三者中，它是生产计算的首选方法）？

我试图通过将对“vectorizeOnly()”的调用插入到上面 "Domain" 的定义，但没有用。

Answer 1

Chapel 的意图是 to support vectorization automatically within the per-task serial loops that are used to implement forall loops (for cases that are legally vectorizable). Yet that capability is not well-supported today, as you note (even the vectorizeOnly() iterator that you're using is only considered prototypical)。

我会提到，与使用（默认）C 后端相比，使用 Chapel 的 LLVM 后端我们往往会看到更好的矢量化结果，而且在使用 Simon Moll 的基于 LLVM 的区域向量器（萨尔大学）。但我们也看到过 LLVM 后端性能不如 C 后端的情况，因此您的情况可能会有所不同。但如果你关心矢量化，那值得一试。

针对您的具体问题：

Are there ways for the user in Chapel to change this default behavior?

有。对于显式 forall 循环，您可以 write your own parallel iterator which can be used to specify a different implementation strategy for a forall loop than our default iterators use. If you implement one that you like, you can then write (or clone and modify) a domain map (background here) 来控制默认情况下如何实现给定数组上的循环（即，如果没有显式调用迭代器）。这允许最终用户为 Chapel 阵列指定不同于我们默认支持的实现策略。

关于您的三个代码变体，我注意到第一个使用多维压缩，这在当今已知存在严重的性能问题。这可能是它与其他产品之间性能差异的主要原因。例如，我怀疑如果您使用 forall (i,j) in Domain ... 形式重写它，然后使用每个维度的 +/-1 索引，您会看到显着的改进（而且，我猜，性能更具可比性到第三种情况）。

对于第三个，我想知道您看到的好处是由于矢量化还是仅仅由于多任务处理，因为您已经避免了第一个的性能问题和第二个的串行实现。例如，您是否检查过使用 vectorizeOnly() 迭代器是否比没有该迭代器的相同代码增加了任何性能改进（或使用二进制文件上的工具来检查是否正在发生矢量化？）

在任何 Chapel 性能研究中，确保抛出 --fast 编译器标志。同样，为了获得最佳矢量化结果，您可以尝试 LLVM 后端。

有没有办法在 Chapel 中自定义整个数组语句的默认并行化行为？

Is there a way to customize the default parallelization behavior of whole-array statements in Chapel?

arrays

parallel-processing

chapel