平均而言,现代 x64 CPU cmpxchg16b 是否比其 64 位或 32 位对应物慢得多?
On the average modern x64 CPU is cmpxchg16b much slower than its 64 or 32 bit counterparts?
我相信 Windows 已经在内部使用该指令很长时间了,所以 CPU 制造商会花精力优化它?
当然假设适当对齐内存并且不共享高速缓存行等。
出于好奇,我写了一个小基准来比较 4 字节和 8 字节的成本 cmpxchg
与 cmpxchg16b
:
#include <cstdint>
#include <benchmark/benchmark.h>
alignas(16) char input[16 * 1024] = {};
template<class T>
void do_benchmark(benchmark::State& state) {
unsigned n = 0;
T* p = reinterpret_cast<T*>(input);
constexpr unsigned count = sizeof input / sizeof(T);
unsigned i = 0;
for(auto _ : state) {
T v{0};
n += __sync_bool_compare_and_swap(p + i++ % count, v, v);
}
benchmark::DoNotOptimize(n);
}
BENCHMARK_TEMPLATE(do_benchmark, std::int32_t);
BENCHMARK_TEMPLATE(do_benchmark, std::int64_t);
BENCHMARK_TEMPLATE(do_benchmark, __int128);
BENCHMARK_MAIN();
并且 运行 它在 Coffee Lake i9-9900KS CPU.
gcc-8.3.0
的结果:
$ make -rC ~/src/test -j8 BUILD=release run_cmpxchg16b_benchmark
g++ -o /home/max/src/test/release/gcc/cmpxchg16b_benchmark.o -c -pthread -march=native -std=gnu++17 -W{all,extra,error,no-{maybe-uninitialized,unused-function}} -g -fmessage-length=0 -O3 -mtune=native -ffast-math -falign-{functions,loops}=64 -DNDEBUG -mcx16 -MD -MP /home/max/src/test/cmpxchg16b_benchmark.cc
g++ -o /home/max/src/test/release/gcc/cmpxchg16b_benchmark -fuse-ld=gold -pthread -g /home/max/src/test/release/gcc/cmpxchg16b_benchmark.o -lrt /usr/local/lib/libbenchmark.a
sudo cpupower frequency-set --related --governor performance >/dev/null
/home/max/src/test/release/gcc/cmpxchg16b_benchmark
2020-03-15 20:18:48
Running /home/max/src/test/release/gcc/cmpxchg16b_benchmark
Run on (16 X 5100 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x8)
L1 Instruction 32 KiB (x8)
L2 Unified 256 KiB (x8)
L3 Unified 16384 KiB (x1)
Load Average: 0.43, 0.40, 0.34
---------------------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------------------
do_benchmark<std::int32_t> 3.53 ns 3.53 ns 198281069
do_benchmark<std::int64_t> 3.53 ns 3.53 ns 198256710
do_benchmark<__int128> 6.35 ns 6.35 ns 110215116
clang-8.0.0
的结果:
$ make -rC ~/src/test -j8 BUILD=release TOOLSET=clang run_cmpxchg16b_benchmark
clang++ -o /home/max/src/test/release/clang/cmpxchg16b_benchmark.o -c -pthread -march=native -std=gnu++17 -W{all,extra,error,no-unused-function} -g -fmessage-length=0 -O3 -mtune=native -ffast-math -falign-functions=64 -DNDEBUG -mcx16 -MD -MP /home/max/src/test/cmpxchg16b_benchmark.cc
clang++ -o /home/max/src/test/release/clang/cmpxchg16b_benchmark -fuse-ld=gold -pthread -g /home/max/src/test/release/clang/cmpxchg16b_benchmark.o -lrt /usr/local/lib/libbenchmark.a
sudo cpupower frequency-set --related --governor performance >/dev/null
/home/max/src/test/release/clang/cmpxchg16b_benchmark
2020-03-15 20:19:00
Running /home/max/src/test/release/clang/cmpxchg16b_benchmark
Run on (16 X 5100 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x8)
L1 Instruction 32 KiB (x8)
L2 Unified 256 KiB (x8)
L3 Unified 16384 KiB (x1)
Load Average: 0.36, 0.39, 0.33
---------------------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------------------
do_benchmark<std::int32_t> 3.84 ns 3.84 ns 182461520
do_benchmark<std::int64_t> 3.84 ns 3.84 ns 182160259
do_benchmark<__int128> 5.99 ns 5.99 ns 116972653
看起来 cmpxchg16b
比 Intel Coffee Lake 上的 8 字节 cmpxchg
贵大约 1.6-1.8 倍。
Ryzen 9 5950X 和 gcc-9.3.0
上的相同基准:
Running /home/max/src/test/release/gcc/cmpxchg16b_benchmark
Run on (32 X 4889.51 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x16)
L1 Instruction 32 KiB (x16)
L2 Unified 512 KiB (x16)
L3 Unified 32768 KiB (x2)
Load Average: 1.11, 0.52, 0.33
---------------------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------------------
do_benchmark<std::int32_t> 1.58 ns 1.58 ns 436624535
do_benchmark<std::int64_t> 1.58 ns 1.58 ns 443977862
do_benchmark<__int128> 2.22 ns 2.22 ns 316143309
cmpxchg16b
在 AMD Ryzen 9 上比 8 字节 cmpxchg
贵大约 1.4 倍。
我相信 Windows 已经在内部使用该指令很长时间了,所以 CPU 制造商会花精力优化它?
当然假设适当对齐内存并且不共享高速缓存行等。
出于好奇,我写了一个小基准来比较 4 字节和 8 字节的成本 cmpxchg
与 cmpxchg16b
:
#include <cstdint>
#include <benchmark/benchmark.h>
alignas(16) char input[16 * 1024] = {};
template<class T>
void do_benchmark(benchmark::State& state) {
unsigned n = 0;
T* p = reinterpret_cast<T*>(input);
constexpr unsigned count = sizeof input / sizeof(T);
unsigned i = 0;
for(auto _ : state) {
T v{0};
n += __sync_bool_compare_and_swap(p + i++ % count, v, v);
}
benchmark::DoNotOptimize(n);
}
BENCHMARK_TEMPLATE(do_benchmark, std::int32_t);
BENCHMARK_TEMPLATE(do_benchmark, std::int64_t);
BENCHMARK_TEMPLATE(do_benchmark, __int128);
BENCHMARK_MAIN();
并且 运行 它在 Coffee Lake i9-9900KS CPU.
gcc-8.3.0
的结果:
$ make -rC ~/src/test -j8 BUILD=release run_cmpxchg16b_benchmark
g++ -o /home/max/src/test/release/gcc/cmpxchg16b_benchmark.o -c -pthread -march=native -std=gnu++17 -W{all,extra,error,no-{maybe-uninitialized,unused-function}} -g -fmessage-length=0 -O3 -mtune=native -ffast-math -falign-{functions,loops}=64 -DNDEBUG -mcx16 -MD -MP /home/max/src/test/cmpxchg16b_benchmark.cc
g++ -o /home/max/src/test/release/gcc/cmpxchg16b_benchmark -fuse-ld=gold -pthread -g /home/max/src/test/release/gcc/cmpxchg16b_benchmark.o -lrt /usr/local/lib/libbenchmark.a
sudo cpupower frequency-set --related --governor performance >/dev/null
/home/max/src/test/release/gcc/cmpxchg16b_benchmark
2020-03-15 20:18:48
Running /home/max/src/test/release/gcc/cmpxchg16b_benchmark
Run on (16 X 5100 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x8)
L1 Instruction 32 KiB (x8)
L2 Unified 256 KiB (x8)
L3 Unified 16384 KiB (x1)
Load Average: 0.43, 0.40, 0.34
---------------------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------------------
do_benchmark<std::int32_t> 3.53 ns 3.53 ns 198281069
do_benchmark<std::int64_t> 3.53 ns 3.53 ns 198256710
do_benchmark<__int128> 6.35 ns 6.35 ns 110215116
clang-8.0.0
的结果:
$ make -rC ~/src/test -j8 BUILD=release TOOLSET=clang run_cmpxchg16b_benchmark
clang++ -o /home/max/src/test/release/clang/cmpxchg16b_benchmark.o -c -pthread -march=native -std=gnu++17 -W{all,extra,error,no-unused-function} -g -fmessage-length=0 -O3 -mtune=native -ffast-math -falign-functions=64 -DNDEBUG -mcx16 -MD -MP /home/max/src/test/cmpxchg16b_benchmark.cc
clang++ -o /home/max/src/test/release/clang/cmpxchg16b_benchmark -fuse-ld=gold -pthread -g /home/max/src/test/release/clang/cmpxchg16b_benchmark.o -lrt /usr/local/lib/libbenchmark.a
sudo cpupower frequency-set --related --governor performance >/dev/null
/home/max/src/test/release/clang/cmpxchg16b_benchmark
2020-03-15 20:19:00
Running /home/max/src/test/release/clang/cmpxchg16b_benchmark
Run on (16 X 5100 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x8)
L1 Instruction 32 KiB (x8)
L2 Unified 256 KiB (x8)
L3 Unified 16384 KiB (x1)
Load Average: 0.36, 0.39, 0.33
---------------------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------------------
do_benchmark<std::int32_t> 3.84 ns 3.84 ns 182461520
do_benchmark<std::int64_t> 3.84 ns 3.84 ns 182160259
do_benchmark<__int128> 5.99 ns 5.99 ns 116972653
看起来 cmpxchg16b
比 Intel Coffee Lake 上的 8 字节 cmpxchg
贵大约 1.6-1.8 倍。
Ryzen 9 5950X 和 gcc-9.3.0
上的相同基准:
Running /home/max/src/test/release/gcc/cmpxchg16b_benchmark
Run on (32 X 4889.51 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x16)
L1 Instruction 32 KiB (x16)
L2 Unified 512 KiB (x16)
L3 Unified 32768 KiB (x2)
Load Average: 1.11, 0.52, 0.33
---------------------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------------------
do_benchmark<std::int32_t> 1.58 ns 1.58 ns 436624535
do_benchmark<std::int64_t> 1.58 ns 1.58 ns 443977862
do_benchmark<__int128> 2.22 ns 2.22 ns 316143309
cmpxchg16b
在 AMD Ryzen 9 上比 8 字节 cmpxchg
贵大约 1.4 倍。