评估不连续内存访问的最坏情况 RAM 有效带宽

Evaluating worst case RAM effective bandwidth with discontinous memory access

我正在尝试评估从主内存到 CPU 的有效内存 "bandwidth"(正在处理的数据的吞吐量)在最坏的情况下:RAM 缓存变得完全低效由于被处理的连续地址的距离很长。据我了解,这里重要的是 RAM 延迟而不是它的带宽(这是传输大的连续数据块时的吞吐量)。

场景是这样的(假设您使用 64 位=8 字节值):

我想了解吞吐量(以字节为单位)。假设 RAM 具有典型的 DDR3 13 ns 延迟,简单计算得出带宽为 8 B/13 ns = 600 MB/s。但这提出了几点:

... effective memory "bandwidth" ... from main memory to CPU in a worst case scenario:

有两种 "worst" 场景:不使用(未命中)CPU 缓存的内存访问和访问太远地址且无法重用打开的 DRAM 行的内存访问。

the RAM cache

缓存不是 RAM 的一部分,它是 CPU 的一部分,并命名为 CPU cache (top part of memory hierarchy)。

is made totally inefficient due to long distances in the successive addresses being treated.

现代 CPU 缓存有许多内置 hardware prefetchers, which may detect non-random steps between several memory accesses. Many prefretchers will detect any step inside aligned 4 kilobyte (KB) page: if you access address1, then address1 + 256 bytes, then L1 prefetcher will start access of address1 + 256*2, address1 + 256*3 etc. Some prefetchers may try to predict out of 4 KB range. So, using only long distances between accesses may be not enough. (prefetchers may be disabled https://software.intel.com/en-us/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processors)

As far as I understand what matters here is the RAM latency and not its bandwidth

是的,当 RAM 访问延迟受限时,有一些模式。

The scenario is this (say you work with 64 bits=8 bytes values):

您可以使用 8 个字节的值;但是您应该考虑内存和缓存使用更大的单元。现代 DRAM 内存具有 64 位(8 字节)宽的总线(在 ECC 的情况下,64+8 为 72 位),许多事务可能使用多个总线时钟周期(DDR4 SDRAM 中的突发预取使用 8n - 8 * 64 位. CPU 高速缓存和内存控制器之间的许多事务也更大并且大小为完整 cache line or as half of cache line. Typical cache line is 64 bytes

you read data at an address
make some light weight CPU computation (so that CPU is not the bottleneck)
then you read data at new address quite far-away from the first one

这种方法不太适合现代乱序 CPUs。 CPU 可以推测性地重新排序机器命令并在当前内存访问完成之前开始执行下一个内存访问。

cpu 缓存和内存延迟的经典测试(lat_mem_rd 来自 lmbench http://www.bitmover.com/lmbench/lat_mem_rd.8.html and many others) use memory array filled with some special pseudo-random pattern of pointers; and the test for read latency is like (https://github.com/foss-for-synopsys-dwc-arc-processors/lmbench/blob/master/src/lat_mem_rd.c#L95

char **p = start_pointer;
for(i = 0; i < N; i++) {
      p = (char **)*p; 
      p = (char **)*p; 
  ... // repeated many times to hide loop overhead
      p = (char **)*p; 
}

因此,下一个指针的地址存储在内存中; cpu 无法推测下一个地址并开始下一次访问,它将等待从缓存或内存中读取数据。

I'd like to have an idea of the throughput (say in bytes).

可以用每秒访问数来衡量;对于字节访问、字访问或 8 字节访问,将有相似数量的 accesses/s 并且吞吐量 (bytes/s) 将乘以所使用的单位。

有时会测量到相似的值 - GUPS - guga-updates per second (data in memory is read, updated and written back) with test of Random Access. This test can use memory of computing cluster of hundreds (or tens of thousands) of PC - check GUP/s column in http://icl.cs.utk.edu/hpcc/hpcc_results.cgi?display=combo

A simple calculation assuming the RAM has typical DDR3 13 ns latency yields a bandwidth of 8 B/ 13 ns = 600 MB/s. But this raises several points:

RAM 有几个延迟(时间)- https://en.wikipedia.org/wiki/Memory_timings

而 13 ns CAS 仅在您访问打开的行时才相关。对于随机访问,您通常会访问关闭的行,并且 T_RCD 延迟会添加到 CAS。