threadfence 隐含了 syncthreads 的效果？

threadfence implies the effect of syncthreads?

我正在 CUDA 中实现并行缩减。

内核有一个__syncthreads等待所有线程从共享内存完成2次读取，然后将总和写回共享内存。

我应该使用 __threadfence_block 来确保对共享内存的写入对下一次迭代的所有线程可见，还是使用 NVIDIA's example 中给出的 __syncthreads？

__syncthreads() 也意味着内存栅栏功能。这包含在 documentation:

waits until all threads in the thread block have reached this point and all global and shared memory accesses made by these threads prior to __syncthreads() are visible to all threads in the block.

所以在这种情况下，除了 __syncthreads()

之外，没有必要使用 __threadfence_block()

您不能用 threadfence 函数代替通常的通用并行缩减中的执行屏障。除了内存防护功能外，还需要执行屏障（__syncthreads()）。一般情况下，一般需要等待所有线程都执行完给定的一轮归约，然后再进行下一轮； __threadfence_block() 本身不会强制 warp 在其他 warp 执行给定的一轮缩减时等待。

所以一般需要__syncthreads()，如果你使用得当，一般不需要__threadfence_block()。

__syncthreads() 表示 __threadfence_block()。

__threadfence_block() 并不意味着 __syncthreads()

threadfence 隐含了 syncthreads 的效果？

threadfence implies the effect of syncthreads?

parallel-processing

synchronization

cuda

reduction

__threadfence 隐含了 __syncthreads 的效果？

__threadfence implies the effect of __syncthreads?

parallel-processing

synchronization

cuda

reduction

threadfence 隐含了 syncthreads 的效果？

threadfence implies the effect of syncthreads?