使用 pre-fetching 优化对数组的线性访问并在 C 中缓存

Question

披露：我在 programmers.stack 上尝试过类似的问题，但那个地方离 activity 堆栈很远。

简介

我倾向于处理大量大图像。它们也以不止一个的顺序出现，必须反复处理和播放。有时我使用 GPU，有时 CPU，有时两者都用。大多数访问模式本质上是线性的（来回），这让我开始思考关于数组的更基本的事情，以及一种方法应该如何编写针对给定硬件上可能的最大内存带宽优化的代码（允许计算不会阻塞 read/write).

测试规格

我在配备 4GB RAM 和 SSD 的 2011 MacbookAir4,2 (I5-2557M) 上完成了此操作。除了 iterm2.
gcc 5.2.0 (homebrew) with flags: -pedantic -std=c99 -Wall -Werror -Wextra -Wno-unused -O0 with additional include and library flags as well as framework flags 为了使用我倾向于使用的 glfw 定时器。没有我也能做到，没关系。当然是全部 64 位。
我已经尝试使用可选的 -fprefetch-loop-arrays 标志进行测试，但它似乎根本不会影响结果

测试

在堆上分配两个 n bytes 的数组 - 其中 n 是 8, 16, 32, 64, 128, 256, 512 and 1024 MB
将array初始化为0xff，一次一个字节
测试 1 - 线性复制

线性副本：

for(uint64_t i = 0; i < ARRAY_NUM; ++i) {
        array_copy[i] = array[i];
    }

测试 2 - 大步复制。这就是令人困惑的地方。我试过在这里玩 pre-fetch 游戏。我已经尝试了每个循环我应该做多少的各种组合，似乎每个循环 ~40 会产生最佳性能。 为什么？我不知道。我知道 malloc in c99 with uint64_t 会给我内存对齐的块。我还看到我的 L1 到 L3 缓存的大小，比这些 320 bytes 大，所以我在打什么？线索可能会在后面的图表中出现。我很想了解这个。

步幅复制：

for(uint64_t i = 0; i < ARRAY_NUM; i=i+40) {
            array_copy[i] = array[i];
            array_copy[i+1] = array[i+1];
            array_copy[i+2] = array[i+2];
            array_copy[i+3] = array[i+3];
            array_copy[i+4] = array[i+4];
            array_copy[i+5] = array[i+5];
            array_copy[i+6] = array[i+6];
            array_copy[i+7] = array[i+7];
            array_copy[i+8] = array[i+8];
            array_copy[i+9] = array[i+9];
            array_copy[i+10] = array[i+10];
            array_copy[i+11] = array[i+11];
            array_copy[i+12] = array[i+12];
            array_copy[i+13] = array[i+13];
            array_copy[i+14] = array[i+14];
            array_copy[i+15] = array[i+15];
            array_copy[i+16] = array[i+16];
            array_copy[i+17] = array[i+17];
            array_copy[i+18] = array[i+18];
            array_copy[i+19] = array[i+19];
            array_copy[i+20] = array[i+20];
            array_copy[i+21] = array[i+21];
            array_copy[i+22] = array[i+22];
            array_copy[i+23] = array[i+23];
            array_copy[i+24] = array[i+24];
            array_copy[i+25] = array[i+25];
            array_copy[i+26] = array[i+26];
            array_copy[i+27] = array[i+27];
            array_copy[i+28] = array[i+28];
            array_copy[i+29] = array[i+29];
            array_copy[i+30] = array[i+30];
            array_copy[i+31] = array[i+31];
            array_copy[i+32] = array[i+32];
            array_copy[i+33] = array[i+33];
            array_copy[i+34] = array[i+34];
            array_copy[i+35] = array[i+35];
            array_copy[i+36] = array[i+36];
            array_copy[i+37] = array[i+37];
            array_copy[i+38] = array[i+38];
            array_copy[i+39] = array[i+39];
    }

测试 3 - 大步阅读。与大步复制相同。

步幅阅读：

    const int imax = 1000;
    for(int j = 0; j < imax; ++j) {
        uint64_t tmp = 0;
        performance = 0;
        time_start = glfwGetTime();
        for(uint64_t i = 0; i < ARRAY_NUM; i=i+40) {
                tmp = array[i];
                tmp = array[i+1];
                tmp = array[i+2];
                tmp = array[i+3];
                tmp = array[i+4];
                tmp = array[i+5];
                tmp = array[i+6];
                tmp = array[i+7];
                tmp = array[i+8];
                tmp = array[i+9];
                tmp = array[i+10];
                tmp = array[i+11];
                tmp = array[i+12];
                tmp = array[i+13];
                tmp = array[i+14];
                tmp = array[i+15];
                tmp = array[i+16];
                tmp = array[i+17];
                tmp = array[i+18];
                tmp = array[i+19];
                tmp = array[i+20];
                tmp = array[i+21];
                tmp = array[i+22];
                tmp = array[i+23];
                tmp = array[i+24];
                tmp = array[i+25];
                tmp = array[i+26];
                tmp = array[i+27];
                tmp = array[i+28];
                tmp = array[i+29];
                tmp = array[i+30];
                tmp = array[i+31];
                tmp = array[i+32];
                tmp = array[i+33];
                tmp = array[i+34];
                tmp = array[i+35];
                tmp = array[i+36];
                tmp = array[i+37];
                tmp = array[i+38];
                tmp = array[i+39];
        }

测试 4 - 线性读数。一个字节一个字节。我很惊讶 -fprefetch-loop-arrays 在这里没有产生任何结果。我以为是针对这些情况的。

线性读数：

for(uint64_t i = 0; i < ARRAY_NUM; ++i) {
            tmp = array[i];
        }

测试 5 - memcpy 作为对比。

memcpy:

memcpy(array_copy, array, ARRAY_NUM*sizeof(uint64_t));

结果

示例输出：

示例输出：

Init done in 0.767 s - size of array: 1024 MBs (x2)
Performance: 1304.325 MB/s

Copying (linear) done in 0.898 s
Performance: 1113.529 MB/s

Copying (stride 40) done in 0.257 s
Performance: 3890.608 MB/s

[1000/1000] Performance stride 40: 7474.322 MB/s
Average: 7523.427 MB/s
Performance MIN: 3231 MB/s | Performance MAX: 7818 MB/s

[1000/1000] Performance dumb: 2504.713 MB/s
Average: 2481.502 MB/s
Performance MIN: 1572 MB/s | Performance MAX: 2644 MB/s

Copying (memcpy) done in 1.726 s
Performance: 579.485 MB/s

--

Init done in 0.415 s - size of array: 512 MBs (x2)
Performance: 1233.136 MB/s

Copying (linear) done in 0.442 s
Performance: 1157.147 MB/s

Copying (stride 40) done in 0.116 s
Performance: 4399.606 MB/s

[1000/1000] Performance stride 40: 6527.004 MB/s
Average: 7166.458 MB/s
Performance MIN: 4359 MB/s | Performance MAX: 7787 MB/s

[1000/1000] Performance dumb: 2383.292 MB/s
Average: 2409.005 MB/s
Performance MIN: 1673 MB/s | Performance MAX: 2641 MB/s

Copying (memcpy) done in 0.102 s
Performance: 5026.476 MB/s

--

Init done in 0.228 s - size of array: 256 MBs (x2)
Performance: 1124.618 MB/s

Copying (linear) done in 0.242 s
Performance: 1057.916 MB/s

Copying (stride 40) done in 0.070 s
Performance: 3650.996 MB/s

[1000/1000] Performance stride 40: 7129.206 MB/s
Average: 7370.537 MB/s
Performance MIN: 4805 MB/s | Performance MAX: 7848 MB/s

[1000/1000] Performance dumb: 2456.129 MB/s
Average: 2435.556 MB/s
Performance MIN: 1496 MB/s | Performance MAX: 2637 MB/s

Copying (memcpy) done in 0.050 s
Performance: 5095.845 MB/s

-- 

Init done in 0.100 s - size of array: 128 MBs (x2)
Performance: 1277.200 MB/s

Copying (linear) done in 0.112 s
Performance: 1147.030 MB/s

Copying (stride 40) done in 0.029 s
Performance: 4424.513 MB/s

[1000/1000] Performance stride 40: 6497.635 MB/s
Average: 6714.540 MB/s
Performance MIN: 4206 MB/s | Performance MAX: 7843 MB/s

[1000/1000] Performance dumb: 2275.336 MB/s
Average: 2335.544 MB/s
Performance MIN: 1572 MB/s | Performance MAX: 2626 MB/s

Copying (memcpy) done in 0.025 s
Performance: 5086.502 MB/s

-- 

Init done in 0.051 s - size of array: 64 MBs (x2)
Performance: 1255.969 MB/s

Copying (linear) done in 0.058 s
Performance: 1104.282 MB/s

Copying (stride 40) done in 0.015 s
Performance: 4305.765 MB/s

[1000/1000] Performance stride 40: 7750.063 MB/s
Average: 7412.167 MB/s
Performance MIN: 3892 MB/s | Performance MAX: 7826 MB/s

[1000/1000] Performance dumb: 2610.136 MB/s
Average: 2577.313 MB/s
Performance MIN: 2126 MB/s | Performance MAX: 2652 MB/s

Copying (memcpy) done in 0.013 s
Performance: 4871.823 MB/s

-- 

Init done in 0.024 s - size of array: 32 MBs (x2)
Performance: 1306.738 MB/s

Copying (linear) done in 0.028 s
Performance: 1148.582 MB/s

Copying (stride 40) done in 0.008 s
Performance: 4265.907 MB/s

[1000/1000] Performance stride 40: 6181.040 MB/s
Average: 7124.592 MB/s
Performance MIN: 3480 MB/s | Performance MAX: 7777 MB/s

[1000/1000] Performance dumb: 2508.669 MB/s
Average: 2556.529 MB/s
Performance MIN: 1966 MB/s | Performance MAX: 2646 MB/s

Copying (memcpy) done in 0.007 s
Performance: 4617.860 MB/s

--

Init done in 0.013 s - size of array: 16 MBs (x2)
Performance: 1243.011 MB/s

Copying (linear) done in 0.014 s
Performance: 1139.362 MB/s

Copying (stride 40) done in 0.004 s
Performance: 4181.548 MB/s

[1000/1000] Performance stride 40: 6317.129 MB/s
Average: 7358.539 MB/s
Performance MIN: 5250 MB/s | Performance MAX: 7816 MB/s

[1000/1000] Performance dumb: 2529.707 MB/s
Average: 2525.783 MB/s
Performance MIN: 1823 MB/s | Performance MAX: 2634 MB/s

Copying (memcpy) done in 0.003 s
Performance: 5167.561 MB/s

--

Init done in 0.007 s - size of array: 8 MBs (x2)
Performance: 1186.019 MB/s

Copying (linear) done in 0.007 s
Performance: 1147.018 MB/s

Copying (stride 40) done in 0.002 s
Performance: 4157.658 MB/s

[1000/1000] Performance stride 40: 6958.839 MB/s
Average: 7097.742 MB/s
Performance MIN: 4278 MB/s | Performance MAX: 7499 MB/s

[1000/1000] Performance dumb: 2585.366 MB/s
Average: 2537.896 MB/s
Performance MIN: 2284 MB/s | Performance MAX: 2610 MB/s

Copying (memcpy) done in 0.002 s
Performance: 5059.164 MB/s

线性阅读比跨步阅读慢3倍。步幅读数最大约。 7500-7800 MB/s 范围。不过有两件事让我感到困惑。在 DDR3 1333 Mhz，最大内存吞吐量应该是 10,664 MB/s 那么为什么我没有达到它？为什么读取速度不一致，我将如何优化它（缓存未命中？）？从图表中可以更明显地看出这一点，尤其是性能有规律下降的线性读数。

图表

8-16MB

32-64MB

128-256 MB

512-1024 MB

一起

这里有任何感兴趣的人的完整资源：

/*
gcc -pedantic -std=c99 -Wall -Werror -Wextra -Wno-unused -O0 -I "...path to glfw3 includes ..." -L "...path to glfw3 lib ..." arr_test_copy_gnuplot.c -o arr_test_copy_gnuplot -lglfw3 -framework OpenGL -framework Cocoa -framework IOKit -framework CoreVideo

optional: -fprefetch-loop-arrays
*/

#include <stdio.h>
#include <stdlib.h>
#include <string.h> /* memcpy */
#include <inttypes.h>
#include <GLFW/glfw3.h>

#define ARRAY_NUM 1000000 * 128 /* GIG */
int main(int argc, char *argv[]) {

    if(!glfwInit()) {
        exit(EXIT_FAILURE);
    }

    int cx = 0;
    char filename_stride[50];
    char filename_dumb[50];
    cx = snprintf(filename_stride, 50, "%lu_stride.dat", 
                    ((ARRAY_NUM*sizeof(uint64_t))/1000000));
    if(cx < 0 || cx >50) { exit(EXIT_FAILURE); }
    FILE *file_stride = fopen(filename_stride, "w");
    cx = snprintf(filename_dumb, 50, "%lu_dumb.dat", 
                    ((ARRAY_NUM*sizeof(uint64_t))/1000000));
    if(cx < 0 || cx >50) { exit(EXIT_FAILURE); }
    FILE *file_dumb   = fopen(filename_dumb, "w");
    if(file_stride == NULL || file_dumb == NULL) {
        perror("Error opening file.");
        exit(EXIT_FAILURE);
    }

    uint64_t *array = malloc(sizeof(uint64_t) * ARRAY_NUM);
    uint64_t *array_copy = malloc(sizeof(uint64_t) * ARRAY_NUM);

    double performance  = 0.0;
    double time_start   = 0.0;
    double time_end     = 0.0;
    double performance_min  = 0.0;
    double performance_max  = 0.0;

    /* Init array */
    time_start = glfwGetTime();
    for(uint64_t i = 0; i < ARRAY_NUM; ++i) {
        array[i] = 0xff;
    }
    time_end = glfwGetTime();

    performance = ((ARRAY_NUM * sizeof(uint64_t))/1000000) / (time_end - time_start);
    printf("Init done in %.3f s - size of array: %lu MBs (x2)\n", (time_end - time_start), (ARRAY_NUM*sizeof(uint64_t)/1000000));
    printf("Performance: %.3f MB/s\n\n", performance);

    /* Linear copy */
    performance = 0;
    time_start = glfwGetTime();
    for(uint64_t i = 0; i < ARRAY_NUM; ++i) {
        array_copy[i] = array[i];
    }
    time_end = glfwGetTime();

    performance = ((ARRAY_NUM * sizeof(uint64_t))/1000000) / (time_end - time_start);
    printf("Copying (linear) done in %.3f s\n", (time_end - time_start));
    printf("Performance: %.3f MB/s\n\n", performance);

    /* Copying with wide stride */
    performance = 0;
    time_start = glfwGetTime();
    for(uint64_t i = 0; i < ARRAY_NUM; i=i+40) {
            array_copy[i] = array[i];
            array_copy[i+1] = array[i+1];
            array_copy[i+2] = array[i+2];
            array_copy[i+3] = array[i+3];
            array_copy[i+4] = array[i+4];
            array_copy[i+5] = array[i+5];
            array_copy[i+6] = array[i+6];
            array_copy[i+7] = array[i+7];
            array_copy[i+8] = array[i+8];
            array_copy[i+9] = array[i+9];
            array_copy[i+10] = array[i+10];
            array_copy[i+11] = array[i+11];
            array_copy[i+12] = array[i+12];
            array_copy[i+13] = array[i+13];
            array_copy[i+14] = array[i+14];
            array_copy[i+15] = array[i+15];
            array_copy[i+16] = array[i+16];
            array_copy[i+17] = array[i+17];
            array_copy[i+18] = array[i+18];
            array_copy[i+19] = array[i+19];
            array_copy[i+20] = array[i+20];
            array_copy[i+21] = array[i+21];
            array_copy[i+22] = array[i+22];
            array_copy[i+23] = array[i+23];
            array_copy[i+24] = array[i+24];
            array_copy[i+25] = array[i+25];
            array_copy[i+26] = array[i+26];
            array_copy[i+27] = array[i+27];
            array_copy[i+28] = array[i+28];
            array_copy[i+29] = array[i+29];
            array_copy[i+30] = array[i+30];
            array_copy[i+31] = array[i+31];
            array_copy[i+32] = array[i+32];
            array_copy[i+33] = array[i+33];
            array_copy[i+34] = array[i+34];
            array_copy[i+35] = array[i+35];
            array_copy[i+36] = array[i+36];
            array_copy[i+37] = array[i+37];
            array_copy[i+38] = array[i+38];
            array_copy[i+39] = array[i+39];
    }
    time_end = glfwGetTime();

    performance = ((ARRAY_NUM * sizeof(uint64_t))/1000000) / (time_end - time_start);
    printf("Copying (stride 40) done in %.3f s\n", (time_end - time_start));
    printf("Performance: %.3f MB/s\n\n", performance);

    /* Reading with wide stride */
    const int imax = 1000;
    double performance_average = 0.0;
    for(int j = 0; j < imax; ++j) {
        uint64_t tmp = 0;
        performance = 0;
        time_start = glfwGetTime();
        for(uint64_t i = 0; i < ARRAY_NUM; i=i+40) {
                tmp = array[i];
                tmp = array[i+1];
                tmp = array[i+2];
                tmp = array[i+3];
                tmp = array[i+4];
                tmp = array[i+5];
                tmp = array[i+6];
                tmp = array[i+7];
                tmp = array[i+8];
                tmp = array[i+9];
                tmp = array[i+10];
                tmp = array[i+11];
                tmp = array[i+12];
                tmp = array[i+13];
                tmp = array[i+14];
                tmp = array[i+15];
                tmp = array[i+16];
                tmp = array[i+17];
                tmp = array[i+18];
                tmp = array[i+19];
                tmp = array[i+20];
                tmp = array[i+21];
                tmp = array[i+22];
                tmp = array[i+23];
                tmp = array[i+24];
                tmp = array[i+25];
                tmp = array[i+26];
                tmp = array[i+27];
                tmp = array[i+28];
                tmp = array[i+29];
                tmp = array[i+30];
                tmp = array[i+31];
                tmp = array[i+32];
                tmp = array[i+33];
                tmp = array[i+34];
                tmp = array[i+35];
                tmp = array[i+36];
                tmp = array[i+37];
                tmp = array[i+38];
                tmp = array[i+39];
        }
        time_end = glfwGetTime();

        performance = ((ARRAY_NUM * sizeof(uint64_t))/1000000) / (time_end - time_start);
        performance_average += performance;
        if(performance > performance_max) { performance_max = performance; }
        if(j == 0) { performance_min = performance; }
        if(performance < performance_min) { performance_min = performance; }

        printf("[%d/%d] Performance stride 40: %.3f MB/s\r", j+1, imax, performance);
        fprintf(file_stride, "%d\t%f\n", j, performance);
        fflush(file_stride);
        fflush(stdout);
    }
    performance_average = performance_average / imax;
    printf("\nAverage: %.3f MB/s\n", performance_average);
    printf("Performance MIN: %3.f MB/s | Performance MAX: %3.f MB/s\n\n", 
            performance_min, performance_max);

    /* Linear reading */
    performance_average = 0.0;
    performance_min     = 0.0;
    performance_max      = 0.0;
    for(int j = 0; j < imax; ++j) {
        uint64_t tmp = 0;
        performance = 0;
        time_start = glfwGetTime();
        for(uint64_t i = 0; i < ARRAY_NUM; ++i) {
            tmp = array[i];
        }
        time_end = glfwGetTime();

        performance = ((ARRAY_NUM * sizeof(uint64_t))/1000000) / (time_end - time_start);
        performance_average += performance;
        if(performance > performance_max) { performance_max = performance; }
        if(j == 0) { performance_min = performance; }
        if(performance < performance_min) { performance_min = performance; }
        printf("[%d/%d] Performance dumb: %.3f MB/s\r", j+1, imax, performance);
        fprintf(file_dumb, "%d\t%f\n", j, performance);
        fflush(file_dumb);
        fflush(stdout);
    }
    performance_average = performance_average / imax;
    printf("\nAverage: %.3f MB/s\n", performance_average);
    printf("Performance MIN: %3.f MB/s | Performance MAX: %3.f MB/s\n\n", 
            performance_min, performance_max);

    /* Memcpy */
    performance = 0;
    time_start = glfwGetTime();
    memcpy(array_copy, array, ARRAY_NUM*sizeof(uint64_t));
    time_end = glfwGetTime();

    performance = ((ARRAY_NUM * sizeof(uint64_t))/1000000) / (time_end - time_start);
    printf("Copying (memcpy) done in %.3f s\n", (time_end - time_start));
    printf("Performance: %.3f MB/s\n", performance);

    /* Cleanup and exit */
    free(array);
    free(array_copy);
    glfwTerminate();
    fclose(file_dumb);
    fclose(file_stride);

    exit(EXIT_SUCCESS);
}

总结

在使用线性访问是最常见模式的数组时，我应该如何编写代码以获得最大和（接近）恒定的速度？
我可以从这个例子中学到什么关于缓存和 pre-fetch 的知识？
这些图表是否告诉我一些我应该知道但我没有注意到的事情？
我还能如何展开循环？我试过 -funroll-loops 没有结果，所以我求助于手动编写 loop-in-loop unrolls.

感谢您的阅读。

编辑：

似乎 -O0 给出了与 -O 标志不存在时不同的性能！是什么赋予了？没有标志会产生更好的性能，如图所示。

编辑2：

我终于达到了 AVX 的极限。

=== READING WITH AVX ===
[1000/1000] Performance AVX: 9868.912 MB/s
Average: 10029.085 MB/s
Performance MIN: 6554 MB/s | Performance MAX: 11464 MB/s

平均值非常接近 10664。我不得不将编译器更改为 clang，因为 gcc 让我很难使用 avx (-mavx)。这也是为什么图形有更明显的下降的原因。我仍然想知道如何/什么是/有持续的表现。我认为这是由于缓存/缓存行造成的。它还可以解释性能在不同地方超过 DDR3 速度（MAX 为 11464 MB/s）。

请原谅我的 gnuplot-fu 和它的钥匙。蓝色是 SSE2 ( _mm_load_si128 )，橙色是 AVX ( _mm256_load_si256 )。紫色像以前一样大步向前，绿色是哑巴，一次读一个。

所以，最后两个问题是：

什么导致了下降以及如何获得更稳定的性能
是否可以在没有内在函数的情况下达到上限？

要点最新版本：https://gist.github.com/Keyframe/1ed9062ec52fc4a0d14b 和那个版本的图表：http://imgur.com/a/cPeor

Answer 1

对于您正在做的事情，我会查看 SIMD（单指令多数据），google 了解 GCC 编译器内部函数的详细信息

Answer 2

您的主内存峰值带宽值偏离了两倍。而不是 10664 MB/s 它 should be 21.3 GB/s （更准确地说它应该是 (21333⅓) MB/s - 请参阅下面的推导）。事实上，您有时看到超过 10664 MB/s 应该告诉您，您的峰值带宽计算可能存在问题。

为了获得最大带宽for Core2 through Sandy Bridge you need to use non-temporal stores. Additionally, you need multiple threads。您不需要 AVX 指令或展开循环。

void copy(char *x, char *y, int n)
{
    #pragma omp parallel for schedule(static)
    for(int i=0; i<n/16; i++)
    {
        _mm_stream_ps((float*)&y[16*i], _mm_load_ps((float*)&x[16*i]));
    }
}

数组需要按 16 字节对齐并且是 16 的倍数。非临时存储的经验法则是当您复制的内存大于最后一级缓存大小的一半时使用它们.在您的情况下，L3 缓存大小的一半是 1.5 MB，您复制的最小数组是 8 MB，因此这比最后一级缓存大小的一半大得多。

下面是一些测试代码。

//gcc -O3 -fopenmp foo.c
#include <stdio.h>
#include <x86intrin.h>
#include <string.h>
#include <omp.h>

void copy(char *x, char *y, int n)
{
    #pragma omp parallel for schedule(static)
    for(int i=0; i<n/16; i++)
    {
        _mm_stream_ps((float*)&x[16*i], _mm_load_ps((float*)&y[16*i]));
    }
}

void copy2(char *x, char *y, int n)
{
    #pragma omp parallel for schedule(static)
    for(int i=0; i<n/16; i++)
    {
        _mm_store_ps((float*)&x[16*i], _mm_load_ps((float*)&y[16*i]));
    }
}

int main(void)
{
    unsigned n = 0x7fffffff;
    char *x = _mm_malloc(n, 16);
    char *y = _mm_malloc(n, 16);
    double dtime;

    memset(x,0,n);
    memset(y,1,n);

    dtime = -omp_get_wtime();
    copy(x,y,n);
    dtime += omp_get_wtime();
    printf("time %f\n", dtime);

    dtime = -omp_get_wtime();
    copy2(x,y,n);
    dtime += omp_get_wtime();
    printf("time %f\n", dtime);

    dtime = -omp_get_wtime();
    memcpy(x,y,n);
    dtime += omp_get_wtime();
    printf("time %f\n", dtime);  
}

在我的系统上，Core2（Nehalem 之前）P9600@2.53GHz，它提供

time non temporal store 0.39
time SSE store          1.10
time memcpy             0.98

复制2GB。

请注意，"touch" 首先要写入的内存非常重要（我使用 memset 来执行此操作）。在您访问它之前，您的系统不一定会分配您的内存。如果在执行内存复制时尚未访问内存，则执行此操作的开销可能会严重影响您的结果。

According to wikipedia DDR3-1333 内存频率为 166⅔ MHz。 DDR 以两倍内存时钟速率传输数据。此外，DDR3 的总线时钟倍频器为 4。因此 DDR3 的每个内存时钟的总乘法为 8。此外，您的主板有两个内存通道。所以总的传输速率是

 21333⅓ MB/s = (166⅔ 1E6 clocks/s) * (8 lines/clock/channel) * (2 channels) * (64-bits/line) * (byte/8-bits) * (MB/1E6 bytes).

Answer 3

您应该使用最近的 GCC 进行编译（因此在 2015 年 11 月编译您的 GCC 5.2 是个好主意），并且您应该为您的特定平台启用优化，因此我建议使用 [=至少 10=]（也尝试用 -O3 替换 -O2）。

^{（不要在编译器中启用优化的情况下对程序进行基准测试）}

如果您担心缓存效果，您可以使用 __builtin_prefetch，但请参阅 this。

另请参阅 OpenMP, OpenCL, OpenACC。

使用 pre-fetching 优化对数组的线性访问并在 C 中缓存

Optimizing linear access to arrays with pre-fetching and cache in C

c

memory

performance

c99