英特尔 CPU 上的原子添加操作和缓存行锁定的 RFO 计数？

Question

我想了解原子添加操作的本质。因此，我正在运行Broadwell 机器中编写以下代码。

int main(int argc, char ** argv){
    int nThreads = -1;
    float shareFrac = -1;
    uint64_t nIter = -1;

    ParseArg(argc, argv, nThreads, shareFrac, nIter);

    atomic<uint64_t> justToAvoidCompilerOptimization;

    #pragma omp parallel num_threads(nThreads)
    {
        int me = omp_get_thread_num();
        atomic<uint64_t> *tsData = &trueSharingData.data[0];
        atomic<uint64_t> *privateData = &(new SharedData_t())->data[0];
        for(uint64_t i = 0 ; i < nIter; i++) {
            // Use RDTSC as a proxy random number generator
            unsigned long lo, hi;
                asm volatile( "rdtsc" : "=a" (lo), "=d" (hi) ); 
                int rNum  = (lo % 54121) % 100; // mod by a prime.
            // if the random number is < shareFrac, perform a shared memory operation
            if (rNum < shareFrac) {
                *tsData += rNum2;
            } else {
                *privateData += rNum;
            }
        }       
        justToAvoidCompilerOptimization += *tsData;     
        justToAvoidCompilerOptimization += *privateData;        
    }


    return justToAvoidCompilerOptimization.load() ^ justToAvoidCompilerOptimization.load();
}

在这段代码中，基本上每个线程都执行原子添加操作 nIter 次，其中 nIter 是循环次数。在每次循环迭代中，原子添加操作可能在共享内存位置或线程局部变量上执行。

在共享内存位置执行原子添加操作所花费的循环次数的分数由参数 shareFrac 确定。例如，如果 shareFrac 为 0.3，而 nIter 为 1000，则预计在共享内存位置执行原子添加大约 300 次。

所以，我做了一个小实验，我运行这个简单的代码随着 shareFrac 值的增加多次。对于每个运行，我使用 perf 计算了 L2_RQSTS.RFO_MISS 事件的发生次数。我还将 perf 给出的计数与预期计数进行比较。预期计数只是 nthreads * nIter * shareFrac.

结果如下。

nThreads = 2, nIter = 1亿

nThreads = 8, nIter = 亿

从图中可以看出，在大多数运行中，RFO 未命中计数超过了预期计数。这怎么可能？？一个可能的解释是，原子添加带来了一条带有 RFO 的行，希望读取然后更新。然而，线路可能在读取和写入之间被盗，在这种情况下，必须将线路取回。但是，据我所知，对于 x86 上的原子操作，cacheline 是锁定的，因此，cacheline 一旦被赋予独占权限就不能被窃取。还是我的理解不正确？

为了消除由于预取而导致缓存行 t运行sfer 的可能性，我还在获得这些结果之前消除了机器所有内核上的 h/w 预取器。

Answer 1

我认为当前的 Intel 总是无条件地锁定高速缓存行以进行原子操作，因此 L2 未命中数应该根据访问次数准确预测的假设可能不准确。

比如this Intel patent的背景描述了锁指令的"conventional"机制，即直接执行指令的lock/load和unlock/store部分背靠背和退休时，这样相关的线路可以很容易地在整个时间保持锁定状态。我认为，这大致符合您描述它的工作方式，如果它仅以这种方式工作，您可能希望 L2 RFO 未命中遵循预期线。

但是，该专利本身描述了一种放宽锁定要求的机制。特别是，尽早执行操作的 load/lock 部分，基本上作为普通加载，并推测关联的缓存在加载执行和存储提交之间的时间不会 "stolen" 。如果确实发生了这种被盗缓存行，则需要重放该操作。用英特尔在专利中的话来说：

However, if the prediction is that the particular lock instruction will in fact not be contended, then it may be possible to proceed with a speculatively-issued normal load micro-operation and monitor the concerned memory location with the monitor logic 116 to determine whether any contended indications arise. Thus, we may not actually lock the memory location while performing the read-modify-write parts of the instruction to enforce atomicity, but instead perform the parts separately while watching for conditions that would indicate that another processor or thread may have broken the perception of atomicity. Such contended indications may include a snoop to the cache line that includes the target address of the load instruction, an interrupt, or if the subsequent store_unlock micro-operation misses in a cache.

The monitor logic 116 may in some embodiments monitor several existing logic signals present within the processor. If no contended indications arise during the period of time representing an equivalent locked condition, then the speculatively-issued normal load micro-operation may retire normally. This may permit out-of-order execution of the lock instruction and enhance processor performance. However, if contended indications do arise, the pipeline may have to be flushed and the lock instruction re-executed.

这只是一小段摘录，但抓住了相关的想法：尝试以与乱序执行更兼容的方式执行锁定，如果失败，请重试采用更保守的方法。该专利继续解释预测器如何工作，并与分支预测进行类比。基本方法是简单地跟踪每个 IP 的争用行为。

这可以解释为什么额外的 RFO 事件在接近 100% 的 shareFrac 时变为零：在这一点上，线路竞争激烈以至于 heuristic/predictor 会尝试更积极的锁定实现没有被触发，所以总是走保守的路径。

您或许可以通过检测是否存在乱序执行的测试来证实这一理论，并表明当 RFO 请求的数量增加时，也会发生一些 OoO 执行。

英特尔 CPU 上的原子添加操作和缓存行锁定的 RFO 计数？

RFO counts for Atomic Add Operations and Cacheline Locking on Intel CPUs?

x86

atomic

cpu-architecture

shared-memory

perf