如何在不等待检索的情况下获取 C++ 中的内存位置?
How to fetch a memory location in C++ without waiting for its retrieval?
假设我们要从数组中获取一个值。
在某些情况下,我们知道数据在data[i]
。
在其他情况下,我们需要将索引移动一些偏移量:i += offset[i]
i = ....
FETCH data[i]; // The result could be here if the offset_shifts is 0
// Or if i is very small (data[i] is in the same cache line & page)
i += offset_shifts[i]; // LLC cache miss (and most probably a TLB miss)
result = data[i]; // LLC cache miss (and most probably a TLB miss),
// unless it is obtained by an earlier FETCH
我期望从这个技巧中得到的好处是,如果 offset_shifts[i]
很小,那么这意味着不会有 TLB 和 LLC 缓存未命中,因此可以用 ONE 的开销完成这些查找内存查找(而不是两次)
如何在从内存中获取 data[i]
的值的同时 获取 offset_shifts[i]
的值?换句话说,在 C++ 中实现这种“非阻塞获取”的正确方法是什么?
C++ 语言标准不提供对此的支持,但一些编译器提供。例如,GCC 提供 __builtin_prefetch
:
Built-in Function: void __builtin_prefetch (const void *addr, ...)
This function is used to minimize cache-miss latency by moving data into a
cache before it is accessed. You can insert calls to __builtin_prefetch into
code for which you know addresses of data in memory that is likely to be
accessed soon. If the target supports them, data prefetch instructions are
generated. If the prefetch is done early enough before the access then the
data will be in the cache by the time it is accessed.
The value of addr is the address of the memory to prefetch. There are two
optional arguments, rw and locality. The value of rw is a compile-time
constant one or zero; one means that the prefetch is preparing for a write to
the memory address and zero, the default, means that the prefetch is preparing
for a read. The value locality must be a compile-time constant integer between
zero and three. A value of zero means that the data has no temporal locality,
so it need not be left in the cache after the access. A value of three means
that the data has a high degree of temporal locality and should be left in all
levels of cache possible. Values of one and two mean, respectively, a low or
moderate degree of temporal locality. The default is three.
for (i = 0; i < n; i++)
{
a[i] = a[i] + b[i];
__builtin_prefetch (&a[i+j], 1, 1);
__builtin_prefetch (&b[i+j], 0, 1);
/* … */
}
Data prefetch does not generate faults if addr is invalid, but the address
expression itself must be valid. For example, a prefetch of p->next does not
fault if p->next is not a valid address, but evaluation faults if p is not a
valid address.
If the target does not support data prefetch, the address expression is
evaluated if it includes side effects but no other code is generated and
GCC does not issue a warning.
我建议事后进行一些测量,看看预取是否真的有很大帮助 - 毫无意义地用不可移植的编译器功能使您的代码复杂化。
在可移植的 C++ 中,我会按如下方式解决它:
result = data[i]; // Unconditional!
auto offset = offset_shifts[i];
if (offset)
result = data[i+offset];
理由是 result
可能只是一个寄存器,所以 result = data[i];
实际上只是一个读取。这将开始读取,但不会阻塞下一个操作的 CPU 管道。 offset_shifts[i]
与前面的操作并行有效地检索。 (优化器甚至可以交换这两个操作——它比我更了解 CPU's)。如果采用分支,您将获得预期的缓存效果。如果不采取,手术就尽可能有效。
假设我们要从数组中获取一个值。
在某些情况下,我们知道数据在data[i]
。
在其他情况下,我们需要将索引移动一些偏移量:i += offset[i]
i = ....
FETCH data[i]; // The result could be here if the offset_shifts is 0
// Or if i is very small (data[i] is in the same cache line & page)
i += offset_shifts[i]; // LLC cache miss (and most probably a TLB miss)
result = data[i]; // LLC cache miss (and most probably a TLB miss),
// unless it is obtained by an earlier FETCH
我期望从这个技巧中得到的好处是,如果 offset_shifts[i]
很小,那么这意味着不会有 TLB 和 LLC 缓存未命中,因此可以用 ONE 的开销完成这些查找内存查找(而不是两次)
如何在从内存中获取 data[i]
的值的同时 获取 offset_shifts[i]
的值?换句话说,在 C++ 中实现这种“非阻塞获取”的正确方法是什么?
C++ 语言标准不提供对此的支持,但一些编译器提供。例如,GCC 提供 __builtin_prefetch
:
Built-in Function: void __builtin_prefetch (const void *addr, ...)
This function is used to minimize cache-miss latency by moving data into a
cache before it is accessed. You can insert calls to __builtin_prefetch into
code for which you know addresses of data in memory that is likely to be
accessed soon. If the target supports them, data prefetch instructions are
generated. If the prefetch is done early enough before the access then the
data will be in the cache by the time it is accessed.
The value of addr is the address of the memory to prefetch. There are two
optional arguments, rw and locality. The value of rw is a compile-time
constant one or zero; one means that the prefetch is preparing for a write to
the memory address and zero, the default, means that the prefetch is preparing
for a read. The value locality must be a compile-time constant integer between
zero and three. A value of zero means that the data has no temporal locality,
so it need not be left in the cache after the access. A value of three means
that the data has a high degree of temporal locality and should be left in all
levels of cache possible. Values of one and two mean, respectively, a low or
moderate degree of temporal locality. The default is three.
for (i = 0; i < n; i++)
{
a[i] = a[i] + b[i];
__builtin_prefetch (&a[i+j], 1, 1);
__builtin_prefetch (&b[i+j], 0, 1);
/* … */
}
Data prefetch does not generate faults if addr is invalid, but the address
expression itself must be valid. For example, a prefetch of p->next does not
fault if p->next is not a valid address, but evaluation faults if p is not a
valid address.
If the target does not support data prefetch, the address expression is
evaluated if it includes side effects but no other code is generated and
GCC does not issue a warning.
我建议事后进行一些测量,看看预取是否真的有很大帮助 - 毫无意义地用不可移植的编译器功能使您的代码复杂化。
在可移植的 C++ 中,我会按如下方式解决它:
result = data[i]; // Unconditional!
auto offset = offset_shifts[i];
if (offset)
result = data[i+offset];
理由是 result
可能只是一个寄存器,所以 result = data[i];
实际上只是一个读取。这将开始读取,但不会阻塞下一个操作的 CPU 管道。 offset_shifts[i]
与前面的操作并行有效地检索。 (优化器甚至可以交换这两个操作——它比我更了解 CPU's)。如果采用分支,您将获得预期的缓存效果。如果不采取,手术就尽可能有效。