CUDA独立线程调度

Question

Q1: 编程指南 v11.6.0 声明以下代码模式在 Volta 和更高版本的 GPU 上有效：

if (tid % warpSize < 16) {
    ...
    float swapped = __shfl_xor_sync(0xffffffff, val, 16);
    ...
} else {
    ...
    float swapped = __shfl_xor_sync(0xffffffff, val, 16);
    ...
}

为什么会这样？

假设先执行if分支，当线程0~15命中__shfl_xor_sync语句时，它们变为不活动状态，线程16~31开始执行指令，直到它们命中相同的语句，其中上半场和下半场扭曲交换 val。我的理解正确吗？

如果是这样，编程指南还指出“如果目标线程处于非活动状态，则检索到的值是未定义的”并且“线程可能由于多种原因而处于非活动状态，包括......采取了不同的分支路径比曲速当前执行的分支路径。”这是否意味着 if 和 else 分支都将获得未定义的值？

Q2:在当前执行独立线程调度（Volta~Ampere）的GPU上，执行if分支时，非活动线程是否仍在执行NOOP？也就是说，我是否仍应将 warp 执行视为步调一致？

Q3：同步（如__shfl_sync、__ballot_sync）是语句交错的唯一原因（语句A和B来自if 分支与来自 else 分支的 X 和 Y 交错）？我很好奇当前的 ITS 与 subwarp interleaving.

有何不同

Answer 1

Q1:

Why so?

这是个特例。编程指南没有给出 __shfl_sync() 的详细行为的完整描述来理解这种情况（据我所知），尽管编程指南中给出的陈述是正确的。要获得指令的详细行为描述，我建议查看 PTX guide:

shfl.sync will cause executing thread to wait until all non-exited threads corresponding to membermask have executed shfl.sync with the same qualifiers and same membermask value before resuming execution.

仔细研究该陈述可能足以理解。但是我们可以稍微解压它。

如前所述，这不适用于低于 7.0 的计算能力。对于那些计算能力，成员掩码中命名的所有线程都必须参与 code/instruction 的确切行，并且要使任何 warp lane 的结果有效，必须命名 source lane在成员掩码中，并且不得由于该代码行的强制分歧而被排除在参与之外
我会在 cc7.0+ 的情况下将 __shfl_sync() 描述为“异常”，因为它会导致 partial-warp 执行在指令的那个点暂停，然后 control/scheduling 会被给予其他经线碎片。那些其他的 warp 片段将被允许继续（由于 Volta ITS），直到成员掩码中命名的所有线程都到达“匹配”的 __shfl_sync() 语句，即具有相同的成员掩码和限定符。然后执行 shuffle 语句。因此，尽管此时强制发散，__shfl_sync() 操作的行为就好像扭曲在该点充分收敛以匹配成员掩码。

我会将其描述为“异常”或“特殊”行为。

If so, the programming guide also states that "if the target thread is inactive, the retrieved value is undefined" and that "threads can be inactive for a variety of reasons including ... having taken a different branch path than the branch path currently executed by the warp."

在我看来，“如果目标线程处于非活动状态，则检索到的值是未定义的”语句最直接适用于小于 7.0 的计算能力。它还适用于计算能力 7.0+ 如果其他地方没有 corresponding/matching shuffle 语句，线程调度程序可以使用它来创建适当的 warp-wide shuffle op。提供的代码示例仅给出合理的结果，因为在 if 部分和 else 部分中都有匹配的操作。如果我们将 else 部分设为空语句，代码将不会为 warp 中的任何线程提供有趣的结果。

Q2:

On GPUs with current implementation of independent thread scheduling (Volta~Ampere), when the if branch is executed, are inactive threads still doing NOOP? That is, should I still think of warp execution as lockstep?

如果我们考虑一般情况，我建议考虑非活动线程的方式是它们是非活动的。如果你愿意，你可以称之为 NOOP。由于强制发散（在我看来），此时的 Warp 执行并不是在整个 Warp 中“步调一致”。我不想在这里争论语义。如果您觉得描述准确，那就是“假设某些线程正在执行指令，而另一些线程则没有执行锁步执行”，那没关系。然而，我们现在已经看到，对于 shuffle sync 操作的特定情况，Volta+ 线程调度程序围绕强制发散工作，结合来自不同执行路径的操作，以满足对该特定指令的期望。

Q3:

Is synchronization (such as __shfl_sync, __ballot_sync) the only cause for statement interleaving (statements A and B from the if branch interleaved with X and Y from the else branch)?

我不这么认为。任何时候你有一个条件 if-else 结构导致除法 intra-warp，你就有可能交错。 I define Volta+ interleaving（图 12）作为一个扭曲片段的向前推进，随后是另一个扭曲片段的向前推进，可能在重新会聚之前继续交替。这种来回交替的能力不仅仅适用于同步操作。原子可以这样处理（这是 Volta ITS 模型的一个特殊 use-case - 例如在 producer/consumer 算法中使用或用于 intra-warp 锁协商 - 称为“无饥饿”在之前链接的文章中），我们还可以想象一个 warp 片段可能由于多种原因（例如数据依赖性，可能是由于加载指令）而停止，这会“暂时”阻止该 warp 片段的前进。我相信 Volta ITS 可以通过从一个 warp 片段到另一个 warp 片段的交替前向进度调度来处理各种可能的延迟。这个想法在引言中的论文中有介绍（“load-to-use”）。抱歉，我无法在此处提供对该论文的扩展讨论。

编辑： 回应评论中的一个问题，解释为“在什么情况下，调度程序可以使用后续的随机播放操作来满足正在等待的扭曲片段的需要洗牌操作完成了吗？

首先，让我们注意到上面的 PTX 描述暗示了某种同步。调度程序已停止执行遇到混洗操作的 warp 片段，等待其他 warp 片段参与（以某种方式）。这是对同步的描述。

其次，PTX 描述允许退出线程。

这一切意味着什么？最简单的描述就是后续的“匹配”shuffle op can/will be &qut;found by the scheduler”，如果可能的话，满足洗牌操作。让我们考虑一些例子。

测试用例 1：如编程指南中所述，我们看到预期结果：

$ cat t1971.cu
#include <cstdio>
__global__ void k(){
    int tid = threadIdx.x;
    float swapped = 32;
    float val = threadIdx.x;
    if (tid % warpSize < 16) {
        swapped = __shfl_xor_sync(0xffffffff, val, 16);
    } else {
        swapped = __shfl_xor_sync(0xffffffff, val, 16);
    }
    printf("thread: %d, swp: %f\n", tid, swapped);
}

int main(){

    k<<<1,32>>>();
    cudaDeviceSynchronize();
}
$ nvcc -arch=sm_70 -o t1971 t1971.cu
$ ./t1971
thread: 0, swp: 16.000000
thread: 1, swp: 17.000000
thread: 2, swp: 18.000000
thread: 3, swp: 19.000000
thread: 4, swp: 20.000000
thread: 5, swp: 21.000000
thread: 6, swp: 22.000000
thread: 7, swp: 23.000000
thread: 8, swp: 24.000000
thread: 9, swp: 25.000000
thread: 10, swp: 26.000000
thread: 11, swp: 27.000000
thread: 12, swp: 28.000000
thread: 13, swp: 29.000000
thread: 14, swp: 30.000000
thread: 15, swp: 31.000000
thread: 16, swp: 0.000000
thread: 17, swp: 1.000000
thread: 18, swp: 2.000000
thread: 19, swp: 3.000000
thread: 20, swp: 4.000000
thread: 21, swp: 5.000000
thread: 22, swp: 6.000000
thread: 23, swp: 7.000000
thread: 24, swp: 8.000000
thread: 25, swp: 9.000000
thread: 26, swp: 10.000000
thread: 27, swp: 11.000000
thread: 28, swp: 12.000000
thread: 29, swp: 13.000000
thread: 30, swp: 14.000000
thread: 31, swp: 15.000000
$

测试用例2：删除else子句的主体。这仍然“有效”，因为允许退出的线程满足同步点，但结果与之前的情况完全不匹配。 None 的洗牌操作“成功”：

$ cat t1971.cu
#include <cstdio>
__global__ void k(){
    int tid = threadIdx.x;
    float swapped = 32;
    float val = threadIdx.x;
    if (tid % warpSize < 16) {
        swapped = __shfl_xor_sync(0xffffffff, val, 16);
    } else {
//        swapped = __shfl_xor_sync(0xffffffff, val, 16);
    }
    printf("thread: %d, swp: %f\n", tid, swapped);
}

int main(){

    k<<<1,32>>>();
    cudaDeviceSynchronize();
}
$ nvcc -arch=sm_70 -o t1971 t1971.cu
$ ./t1971
thread: 16, swp: 32.000000
thread: 17, swp: 32.000000
thread: 18, swp: 32.000000
thread: 19, swp: 32.000000
thread: 20, swp: 32.000000
thread: 21, swp: 32.000000
thread: 22, swp: 32.000000
thread: 23, swp: 32.000000
thread: 24, swp: 32.000000
thread: 25, swp: 32.000000
thread: 26, swp: 32.000000
thread: 27, swp: 32.000000
thread: 28, swp: 32.000000
thread: 29, swp: 32.000000
thread: 30, swp: 32.000000
thread: 31, swp: 32.000000
thread: 0, swp: 0.000000
thread: 1, swp: 0.000000
thread: 2, swp: 0.000000
thread: 3, swp: 0.000000
thread: 4, swp: 0.000000
thread: 5, swp: 0.000000
thread: 6, swp: 0.000000
thread: 7, swp: 0.000000
thread: 8, swp: 0.000000
thread: 9, swp: 0.000000
thread: 10, swp: 0.000000
thread: 11, swp: 0.000000
thread: 12, swp: 0.000000
thread: 13, swp: 0.000000
thread: 14, swp: 0.000000
thread: 15, swp: 0.000000
$

测试用例3：使用测试用例2，引入屏障，防止线程退出。现在我们看到 Volta 挂起。这是因为与 shuffle op 关联的同步点永远无法满足：

$ cat t1971.cu
#include <cstdio>
__global__ void k(){
    int tid = threadIdx.x;
    float swapped = 32;
    float val = threadIdx.x;
    if (tid % warpSize < 16) {
        swapped = __shfl_xor_sync(0xffffffff, val, 16);
    } else {
//        swapped = __shfl_xor_sync(0xffffffff, val, 16);
    }
    __syncwarp();
    printf("thread: %d, swp: %f\n", tid, swapped);
}

int main(){

    k<<<1,32>>>();
    cudaDeviceSynchronize();
}
$ nvcc -arch=sm_70 -o t1971 t1971.cu
$ ./t1971
<hang>

测试用例4：从测试用例2开始，在条件区后面引入一个额外的shuffle op。在这种情况下，我们看到部分正确的结果。在条件区域中遇到混洗操作的 warp 片段的同步点显然由在条件区域外遇到混洗操作的剩余 warp 片段满足。然而，正如我们将看到的，对部分正确结果的解释是一个 warp 片段正在进行 2 次洗牌，另一个仅进行 1 次洗牌。进行两次洗牌的片段（下方片段）有第二个洗牌操作，其同步点得到满足由于存在线程条件，但其结果“不正确”，因为此时源通道未参与；他们已经退出：

$ cat t1971.cu
#include <cstdio>
__global__ void k(){
    int tid = threadIdx.x;
    float swapped = 32;
    float val = threadIdx.x;
    if (tid % warpSize < 16) {
        swapped = __shfl_xor_sync(0xffffffff, val, 16);
    } else {
//        swapped = __shfl_xor_sync(0xffffffff, val, 16);
    }
    swapped = __shfl_xor_sync(0xffffffff, val, 16);
    printf("thread: %d, swp: %f\n", tid, swapped);
}

int main(){

    k<<<1,32>>>();
    cudaDeviceSynchronize();
}
$ nvcc -arch=sm_70 -o t1971 t1971.cu
$ ./t1971
thread: 16, swp: 0.000000
thread: 17, swp: 1.000000
thread: 18, swp: 2.000000
thread: 19, swp: 3.000000
thread: 20, swp: 4.000000
thread: 21, swp: 5.000000
thread: 22, swp: 6.000000
thread: 23, swp: 7.000000
thread: 24, swp: 8.000000
thread: 25, swp: 9.000000
thread: 26, swp: 10.000000
thread: 27, swp: 11.000000
thread: 28, swp: 12.000000
thread: 29, swp: 13.000000
thread: 30, swp: 14.000000
thread: 31, swp: 15.000000
thread: 0, swp: 0.000000
thread: 1, swp: 0.000000
thread: 2, swp: 0.000000
thread: 3, swp: 0.000000
thread: 4, swp: 0.000000
thread: 5, swp: 0.000000
thread: 6, swp: 0.000000
thread: 7, swp: 0.000000
thread: 8, swp: 0.000000
thread: 9, swp: 0.000000
thread: 10, swp: 0.000000
thread: 11, swp: 0.000000
thread: 12, swp: 0.000000
thread: 13, swp: 0.000000
thread: 14, swp: 0.000000
thread: 15, swp: 0.000000
$

测试用例5：从测试用例4开始，最后引入同步。我们再次观察到挂起。正在进行 2 个随机播放操作的 warp 片段（下部）没有满足其第二个随机播放操作同步点：

$ cat t1971.cu
#include <cstdio>
__global__ void k(){
    int tid = threadIdx.x;
    float swapped = 32;
    float val = threadIdx.x;
    if (tid % warpSize < 16) {
        swapped = __shfl_xor_sync(0xffffffff, val, 16);
    } else {
//        swapped = __shfl_xor_sync(0xffffffff, val, 16);
    }
    swapped = __shfl_xor_sync(0xffffffff, val, 16);
    printf("thread: %d, swp: %f\n", tid, swapped);
    __syncwarp();
}

int main(){

    k<<<1,32>>>();
    cudaDeviceSynchronize();
}
$ nvcc -arch=sm_70 -o t1971 t1971.cu
$ ./t1971
thread: 16, swp: 0.000000
thread: 17, swp: 1.000000
thread: 18, swp: 2.000000
thread: 19, swp: 3.000000
thread: 20, swp: 4.000000
thread: 21, swp: 5.000000
thread: 22, swp: 6.000000
thread: 23, swp: 7.000000
thread: 24, swp: 8.000000
thread: 25, swp: 9.000000
thread: 26, swp: 10.000000
thread: 27, swp: 11.000000
thread: 28, swp: 12.000000
thread: 29, swp: 13.000000
thread: 30, swp: 14.000000
thread: 31, swp: 15.000000
<hang>

此时挂起之前的部分打印输出是预期的。这是留给 reader 解释的练习：

为什么我们看到任何打印输出？
为什么会这样（只有上面的片段，但显然有正确的洗牌结果）？

CUDA独立线程调度

CUDA independent thread scheduling

cuda