分支和谓词指令
Branch and predicated instructions
CUDA C Programming Guide 的 Section 5.4.2 声明分支发散由 "branch instructions" 或在特定条件下 "predicated instructions" 处理。我不明白两者之间的区别,以及为什么一个比另一个导致更好的性能。
This comment 表明分支指令导致更多的执行指令,由于 "branch address resolution and fetch" 而停止,并且由于 "the branch itself" 和 "book keeping for divergence" 而导致开销指令仅产生 "instruction execution latency to do the condition test and set the predicate"。为什么?
指令谓词是指线程根据谓词有条件地执行一条指令。谓词为真的线程执行指令,其余的什么也不做。
例如:
var = 0;
// Not taken by all threads
if (condition) {
var = 1;
} else {
var = 2;
}
output = var;
会导致(不是实际的编译器输出):
mov.s32 var, 0; // Executed by all threads.
setp pred, condition; // Executed by all threads, sets predicate.
@pred mov.s32 var, 1; // Executed only by threads where pred is true.
@!pred mov.s32 var, 2; // Executed only by threads where pred is false.
mov.s32 output, var; // Executed by all threads.
总而言之,if
有 3 条指令,没有分支。效率很高。
带有分支的等效代码如下所示:
mov.s32 var, 0; // Executed by all threads.
setp pred, condition; // Executed by all threads, sets predicate.
@!pred bra IF_FALSE; // Conditional branches are predicated instructions.
IF_TRUE: // Label for clarity, not actually used.
mov.s32 var, 1;
bra IF_END;
IF_FALSE:
mov.s32 var, 2;
IF_END:
mov.s32 output, var;
注意它有多长(if
有 5 条指令)。条件分支需要禁用部分 warp,执行第一条路径,然后回滚到 warp 分叉的点并执行第二条路径,直到两者收敛。它需要更长的时间,需要额外的簿记,更多的代码加载(特别是在有很多指令要执行的情况下),因此需要更多的内存请求。所有这些都使分支比简单断言慢。
实际上,对于这个非常简单的条件赋值,编译器可以做得更好,if
:
只需要 2 条指令
mov.s32 var, 0; // Executed by all threads.
setp pred, condition; // Executed by all threads, sets predicate.
selp var, 1, 2, pred; // Sets var depending on predicate (true: 1, false: 2).
Section 5.4.2 声明分支发散由 "branch instructions" 或在特定条件下 "predicated instructions" 处理。我不明白两者之间的区别,以及为什么一个比另一个导致更好的性能。
This comment 表明分支指令导致更多的执行指令,由于 "branch address resolution and fetch" 而停止,并且由于 "the branch itself" 和 "book keeping for divergence" 而导致开销指令仅产生 "instruction execution latency to do the condition test and set the predicate"。为什么?
指令谓词是指线程根据谓词有条件地执行一条指令。谓词为真的线程执行指令,其余的什么也不做。
例如:
var = 0;
// Not taken by all threads
if (condition) {
var = 1;
} else {
var = 2;
}
output = var;
会导致(不是实际的编译器输出):
mov.s32 var, 0; // Executed by all threads.
setp pred, condition; // Executed by all threads, sets predicate.
@pred mov.s32 var, 1; // Executed only by threads where pred is true.
@!pred mov.s32 var, 2; // Executed only by threads where pred is false.
mov.s32 output, var; // Executed by all threads.
总而言之,if
有 3 条指令,没有分支。效率很高。
带有分支的等效代码如下所示:
mov.s32 var, 0; // Executed by all threads.
setp pred, condition; // Executed by all threads, sets predicate.
@!pred bra IF_FALSE; // Conditional branches are predicated instructions.
IF_TRUE: // Label for clarity, not actually used.
mov.s32 var, 1;
bra IF_END;
IF_FALSE:
mov.s32 var, 2;
IF_END:
mov.s32 output, var;
注意它有多长(if
有 5 条指令)。条件分支需要禁用部分 warp,执行第一条路径,然后回滚到 warp 分叉的点并执行第二条路径,直到两者收敛。它需要更长的时间,需要额外的簿记,更多的代码加载(特别是在有很多指令要执行的情况下),因此需要更多的内存请求。所有这些都使分支比简单断言慢。
实际上,对于这个非常简单的条件赋值,编译器可以做得更好,if
:
mov.s32 var, 0; // Executed by all threads.
setp pred, condition; // Executed by all threads, sets predicate.
selp var, 1, 2, pred; // Sets var depending on predicate (true: 1, false: 2).