缓存命中矩阵乘法
Cache hits on matrix multiplication
https://youtu.be/o7h_sYMk_oc?t=1963
在这个视频中,他解释说检索远处的数据会导致更差的缓存行利用率,然后他接着说了一条我不明白的行。
“所以处理器将引入 64 个字节来对特定数据进行操作。然后它会忽略该缓存行上 8 个浮点字中的 7 个并转到下一个”
他这是什么意思。
缓存通常基于缓存行。当数据读入缓存时,是通过读取一个完整的缓存行来完成的。因此,如果缓存行包含 64 个字节,则处理器 HW 确保将 64 个连续字节从内存读取到缓存中。如果一个浮点双精度数是 8 个字节,那么单个缓存行可以容纳 8 个双精度数。
现在,如果您的代码使用连续的双打,缓存访问将是:
Access double located in Addr --> Miss, 64 bytes read into the cache (slow)
Access double located in Addr+1 --> Hit (fast)
Access double located in Addr+2 --> Hit (fast)
Access double located in Addr+3 --> Hit (fast)
Access double located in Addr+4 --> Hit (fast)
Access double located in Addr+5 --> Hit (fast)
Access double located in Addr+6 --> Hit (fast)
Access double located in Addr+7 --> Hit (fast)
Access double located in Addr+8 --> Miss, 64 bytes read into the cache (slow)
Access double located in Addr+9 --> Hit (fast)
Access double located in Addr+10 --> Hit (fast)
Access double located in Addr+11 --> Hit (fast)
Access double located in Addr+12 --> Hit (fast)
Access double located in Addr+13 --> Hit (fast)
Access double located in Addr+14 --> Hit (fast)
Access double located in Addr+15 --> Hit (fast)
Access double located in Addr+16 --> Miss, 64 bytes read into the cache (slow)
Access double located in Addr+17 --> Hit (fast)
...
所以在这里你有 1 次慢速读取,然后是 7 次快速读取,因为你的程序使用连续的双精度数。
但是,如果您的程序总是使用彼此间隔 8 个双精度数(又名 64 字节)的双精度数,则您的模式将为:
Access double located in Addr --> Miss, 64 bytes read into the cache (slow)
Access double located in Addr+8 --> Miss, 64 bytes read into the cache (slow)
Access double located in Addr+16 --> Miss, 64 bytes read into the cache (slow)
...
在这里你只会得到缓慢的读取,你不会从缓存系统中获得任何好处。
https://youtu.be/o7h_sYMk_oc?t=1963
在这个视频中,他解释说检索远处的数据会导致更差的缓存行利用率,然后他接着说了一条我不明白的行。 “所以处理器将引入 64 个字节来对特定数据进行操作。然后它会忽略该缓存行上 8 个浮点字中的 7 个并转到下一个” 他这是什么意思。
缓存通常基于缓存行。当数据读入缓存时,是通过读取一个完整的缓存行来完成的。因此,如果缓存行包含 64 个字节,则处理器 HW 确保将 64 个连续字节从内存读取到缓存中。如果一个浮点双精度数是 8 个字节,那么单个缓存行可以容纳 8 个双精度数。
现在,如果您的代码使用连续的双打,缓存访问将是:
Access double located in Addr --> Miss, 64 bytes read into the cache (slow)
Access double located in Addr+1 --> Hit (fast)
Access double located in Addr+2 --> Hit (fast)
Access double located in Addr+3 --> Hit (fast)
Access double located in Addr+4 --> Hit (fast)
Access double located in Addr+5 --> Hit (fast)
Access double located in Addr+6 --> Hit (fast)
Access double located in Addr+7 --> Hit (fast)
Access double located in Addr+8 --> Miss, 64 bytes read into the cache (slow)
Access double located in Addr+9 --> Hit (fast)
Access double located in Addr+10 --> Hit (fast)
Access double located in Addr+11 --> Hit (fast)
Access double located in Addr+12 --> Hit (fast)
Access double located in Addr+13 --> Hit (fast)
Access double located in Addr+14 --> Hit (fast)
Access double located in Addr+15 --> Hit (fast)
Access double located in Addr+16 --> Miss, 64 bytes read into the cache (slow)
Access double located in Addr+17 --> Hit (fast)
...
所以在这里你有 1 次慢速读取,然后是 7 次快速读取,因为你的程序使用连续的双精度数。
但是,如果您的程序总是使用彼此间隔 8 个双精度数(又名 64 字节)的双精度数,则您的模式将为:
Access double located in Addr --> Miss, 64 bytes read into the cache (slow)
Access double located in Addr+8 --> Miss, 64 bytes read into the cache (slow)
Access double located in Addr+16 --> Miss, 64 bytes read into the cache (slow)
...
在这里你只会得到缓慢的读取,你不会从缓存系统中获得任何好处。