_mm512_storenr_pd 和 _mm512_storenrngo_pd

_mm512_storenr_pd and _mm512_storenrngo_pd

_mm512_storenrngo_pd and _mm512_storenr_pd有什么区别?

_mm512_storenr_pd(void * mt, __m512d v):

Stores packed double-precision (64-bit) floating-point elements from v to memory address mt with a no-read hint to the processor.

我不清楚未读提示是什么意思。这是否意味着它是非缓存一致性写入。这是否意味着重用更昂贵或不连贯?

_mm512_storenrngo_pd(void * mt, __m512d v):

Stores packed double-precision (64-bit) floating-point elements from v to memory address mt with a no-read hint and using a weakly-ordered memory consistency model (stores performed with this function are not globally ordered, and subsequent stores from the same thread can be observed before them).

storenr_pd基本相同,但由于它使用弱一致性模型,这意味着一个进程可以在任何其他处理器之前查看自己的写入。但是另一个处理器的访问是不一致的还是更昂贵?

引自Intel® Xeon Phi™ Coprocessor Vector Microarchitecture

In general, in order to write to a cache line, the Xeon Phi™ coprocessor needs to read in a cache line before writing to it. This is known as read for ownership (RFO). One problem with this implementation is that the written data is not reused; we unnecessarily take up the BW for reading non-temporal data. The Intel® Xeon Phi™ coprocessor supports instructions that do not read in data if the data is a streaming store. These instructions, VMOVNRAP*, VMOVNRNGOAP* allow one to indicate that the data needs to be written without reading the data first. In the Xeon Phi ISA the VMOVNRAPS/VMOVNRPD instructions are able to optimize the memory BW in case of a cache miss by not going through the unnecessary read step.

The VMOVNRNGOAP* instructions are useful when the programmer tolerates weak write-ordering of the application data―that is, the stores performed by these instructions are not globally ordered. This means that the subsequent write by the same thread can be observed before the VMOVNRNGOAP instructions are executed. A memory-fencing operation should be used in conjunction with this operation if multiple threads are reading and writing to the same location.

似乎“未读提示”、“流式存储”和“非时态Stream/Store”在多个资源中可互换使用。

所以是的,它是非缓存一致写入,尽管在 Knights Corner(KNC,vmovnrap* 和 vmovnrngoap* 都属于 KNC)中,存储恰好发生在 L2 缓存中,它不会绕过所有级别的缓存。

如上文所述,vmovnrngoap*vmovnrap* 不同,弱序内存一致性模型允许“在执行 VMOVNRNGOAP 指令之前可以观察到同一线程的后续写入",所以是的,另一个线程或处理器的访问是不一致的,应该使用隔离操作。尽管 CPUID 可以用作防护操作,但更好的选择是 "LOCK ADD [RSP],0"(虚拟原子添加)或 XCHG(结合了存储和防护)。

更多细节:

NR Stores.The NR store instruction (vmovnr) is a standard vector store instruction that can always be used safely. An NR store instruction that misses in the local cache causes all potential copies of the cache line in remote caches to be invalidated, the cache line to be allocated (but not initialized) at the local cache in exclusive state, and the write-data in the instruction to be written to the cacheline. There is no data transfer from main memory which is what saves memory bandwidth. An NR store instruction and other load and/or store instructions from the same thread are globally ordered, which means that all observers of this sequence of instructions always see the same fixed execution order.

The NR.NGO (non-globally ordered) store instruction(vmovnrngo) relaxes the global ordering constraint of the NR store instruction.This relaxation makes the NR.NGO instruction have a lower latency than the NRinstruction, which can be used to achieve higher performance in streaming storeintensive applications. However, removing this restriction means that an NR.NGO store instruction and other load and/or store instructions from the same thread can be observed by two observers to have two different orderings. The use of NR.NGO store instructions is safe only when reordering the order of these instructions is verified not to change the outcome. Otherwise, using NR.NGO stores may lead to incorrect execution. Our compiler can generate NR.NGO store instructions for store instructions that it identifies to have non-temporal behavior. For instance, a parallel loop that is detected to be non-temporal by our compiler can make use of NR.NGO instructions. At the end of such a loop, to ensure all outstanding non-globally ordered stores are completed and all threads have a consistent view of memory, our compiler generates a fence (a lock instruction) after the loop. This fence is needed before continuing execution of the subsequent code fragment to ensure all threads have exactly the same view of memory.

一般的经验法则是,非临时存储有益于近期不会重用的内存访问块。因此,是的,在这两种情况下重用都会很昂贵。