AVX2 simd 在较高优化级别下对标量的表现相对较差
AVX2 simd performs relatively worse to scalar at higher optimization level
我正在学习和使用 SIMD 函数并编写了一个简单的程序,比较了它在 1 秒 内可以 运行 与普通标量相比的向量加法指令的数量添加。 我发现 SIMD 在较低的优化级别上表现相对较好,而在较高的优化级别上始终较差,我想知道原因 我同时使用了 MSVC 和 gcc,这是同一个故事。以下结果来自Ryzen 7CPU。我还在 Intel 平台上进行了测试,情况也差不多。
#include <iostream>
#include <numeric>
#include <chrono>
#include <iterator>
#include <thread>
#include <atomic>
#include <vector>
#include <immintrin.h>
int main()
{
const auto threadLimit = std::thread::hardware_concurrency() - 1; //for running main()
for (auto i = 1; i <= threadLimit; ++i)
{
std::cerr << "Testing " << i << " threads: ";
std::atomic<unsigned long long> sumScalar {};
std::atomic<unsigned long long> loopScalar {};
std::atomic<unsigned long long> sumSimd {};
std::atomic<unsigned long long> loopSimd {};
std::atomic_bool stopFlag{ false };
std::vector<std::thread> threads;
threads.reserve(i);
{
for (auto j = 0; j < i; ++j)
threads.emplace_back([&]
{
uint32_t local{};
uint32_t loop{};
while (!stopFlag)
{
++local;
++loop; //removed this(see EDIT)
}
sumScalar += local;
loopScalar += loop;
});
std::this_thread::sleep_for(std::chrono::seconds{ 1 });
stopFlag = true;
for (auto& thread : threads)
thread.join();
}
threads.clear();
stopFlag = false;
{
for (auto j = 0; j < i; ++j)
threads.emplace_back([&]
{
const auto oneVec = _mm256_set1_epi32(1);
auto local = _mm256_set1_epi32(0);
uint32_t inc{};
while (!stopFlag)
{
local = _mm256_add_epi32(oneVec, local);
++inc; //removed this(see EDIT)
}
sumSimd += std::accumulate(reinterpret_cast<uint32_t*>(&local), reinterpret_cast<uint32_t*>(&local) + 8, uint64_t{});
loopSimd += inc;
});
std::this_thread::sleep_for(std::chrono::seconds{ 1 });
stopFlag = true;
for (auto& thread : threads)
thread.join();
}
std::cout << "Sum: "<<sumSimd <<" / "<<sumScalar <<"("<<100.0*sumSimd/sumScalar<<"%)\t"<<"Loop: "<<loopSimd<<" / "<<loopScalar<<"("<< 100.0*loopSimd/loopScalar<<"%)\n";
// SIMD/Scalar, higher value means SIMD better
}
}
使用 g++ -O0 -march=native -lpthread
,我得到:
Testing 1 threads: Sum: 1004405568 / 174344207(576.105%) Loop: 125550696 / 174344207(72.0131%)
Testing 2 threads: Sum: 2001473960 / 348079929(575.004%) Loop: 250184245 / 348079929(71.8755%)
Testing 3 threads: Sum: 2991335152 / 521830834(573.238%) Loop: 373916894 / 521830834(71.6548%)
Testing 4 threads: Sum: 3892119680 / 693704725(561.063%) Loop: 486514960 / 693704725(70.1329%)
Testing 5 threads: Sum: 4957263080 / 802362140(617.834%) Loop: 619657885 / 802362140(77.2292%)
Testing 6 threads: Sum: 5417700112 / 953587414(568.139%) Loop: 677212514 / 953587414(71.0174%)
Testing 7 threads: Sum: 6078496824 / 1067533241(569.396%) Loop: 759812103 / 1067533241(71.1746%)
Testing 8 threads: Sum: 6679841000 / 1196224828(558.41%) Loop: 834980125 / 1196224828(69.8013%)
Testing 9 threads: Sum: 7396623960 / 1308004474(565.489%) Loop: 924577995 / 1308004474(70.6861%)
Testing 10 threads: Sum: 8158849904 / 1416026963(576.179%) Loop: 1019856238 / 1416026963(72.0224%)
Testing 11 threads: Sum: 8868695984 / 1556964234(569.615%) Loop: 1108586998 / 1556964234(71.2018%)
Testing 12 threads: Sum: 9441092968 / 1655554694(570.268%) Loop: 1180136621 / 1655554694(71.2835%)
Testing 13 threads: Sum: 9530295080 / 1689916907(563.951%) Loop: 1191286885 / 1689916907(70.4938%)
Testing 14 threads: Sum: 10444142536 / 1805583762(578.436%) Loop: 1305517817 / 1805583762(72.3045%)
Testing 15 threads: Sum: 10834255144 / 1926575218(562.358%) Loop: 1354281893 / 1926575218(70.2948%)
使用 g++ -O3 -march=native -lpthread
,我得到:
Testing 1 threads: Sum: 2933270968 / 3112671000(94.2365%) Loop: 366658871 / 3112671000(11.7796%)
Testing 2 threads: Sum: 5839842040 / 6177278029(94.5375%) Loop: 729980255 / 6177278029(11.8172%)
Testing 3 threads: Sum: 8775103584 / 9219587924(95.1789%) Loop: 1096887948 / 9219587924(11.8974%)
Testing 4 threads: Sum: 11350253944 / 10210948580(111.158%) Loop: 1418781743 / 10210948580(13.8947%)
Testing 5 threads: Sum: 14487451488 / 14623220822(99.0715%) Loop: 1810931436 / 14623220822(12.3839%)
Testing 6 threads: Sum: 17141556576 / 14437058094(118.733%) Loop: 2142694572 / 14437058094(14.8416%)
Testing 7 threads: Sum: 19883362288 / 18313186637(108.574%) Loop: 2485420286 / 18313186637(13.5718%)
Testing 8 threads: Sum: 22574437968 / 17115166001(131.897%) Loop: 2821804746 / 17115166001(16.4872%)
Testing 9 threads: Sum: 25356792368 / 18332200070(138.318%) Loop: 3169599046 / 18332200070(17.2898%)
Testing 10 threads: Sum: 28079398984 / 20747150935(135.341%) Loop: 3509924873 / 20747150935(16.9176%)
Testing 11 threads: Sum: 30783433560 / 21801526415(141.199%) Loop: 3847929195 / 21801526415(17.6498%)
Testing 12 threads: Sum: 33420443880 / 22794998080(146.613%) Loop: 4177555485 / 22794998080(18.3266%)
Testing 13 threads: Sum: 35989535640 / 23596768252(152.519%) Loop: 4498691955 / 23596768252(19.0649%)
Testing 14 threads: Sum: 38647578408 / 23796083111(162.412%) Loop: 4830947301 / 23796083111(20.3014%)
Testing 15 threads: Sum: 41148330392 / 24252804239(169.664%) Loop: 5143541299 / 24252804239(21.208%)
编辑:删除 loop
变量后,在两种情况下只留下 local
(请参阅代码中的编辑),结果仍然相同。
EDIT2:上面的结果是在 Ubuntu 上使用 GCC 9.3。我在 Windows (mingw)、 上切换到 GCC 10.2,它显示了很好的缩放比例,见下文(结果是原始代码)。几乎可以断定是 MSVC 和 GCC 旧版本的问题?
Testing 1 threads: Sum: 23752640416 / 3153263747(753.272%) Loop: 2969080052 / 3153263747(94.159%)
Testing 2 threads: Sum: 46533874656 / 6012052456(774.01%) Loop: 5816734332 / 6012052456(96.7512%)
Testing 3 threads: Sum: 66076900784 / 9260324764(713.548%) Loop: 8259612598 / 9260324764(89.1936%)
Testing 4 threads: Sum: 92216030528 / 12229625883(754.038%) Loop: 11527003816 / 12229625883(94.2548%)
Testing 5 threads: Sum: 111822357864 / 14439219677(774.435%) Loop: 13977794733 / 14439219677(96.8044%)
Testing 6 threads: Sum: 122858189272 / 17693796489(694.357%) Loop: 15357273659 / 17693796489(86.7947%)
Testing 7 threads: Sum: 148478021656 / 19618236169(756.837%) Loop: 18559752707 / 19618236169(94.6046%)
Testing 8 threads: Sum: 156931719736 / 19770409566(793.771%) Loop: 19616464967 / 19770409566(99.2213%)
Testing 9 threads: Sum: 143331726552 / 20753115024(690.652%) Loop: 17916465819 / 20753115024(86.3315%)
Testing 10 threads: Sum: 143541178880 / 20331801415(705.993%) Loop: 17942647360 / 20331801415(88.2492%)
Testing 11 threads: Sum: 160425817888 / 22209102603(722.343%) Loop: 20053227236 / 22209102603(90.2928%)
Testing 12 threads: Sum: 157095281392 / 23178532051(677.762%) Loop: 19636910174 / 23178532051(84.7202%)
Testing 13 threads: Sum: 156015224880 / 23818567634(655.015%) Loop: 19501903110 / 23818567634(81.8769%)
Testing 14 threads: Sum: 145464754912 / 23950304389(607.361%) Loop: 18183094364 / 23950304389(75.9201%)
Testing 15 threads: Sum: 149279587872 / 23585183977(632.938%) Loop: 18659948484 / 23585183977(79.1172%)
reinterpret_cast<uint32_t*>(&local)
在循环将 GCC9 获取到 store/reload local
inside 循环之后,创建一个 store-forwarding瓶颈.
这已在 GCC10 中修复;无需提交 missed-optimization 错误。 不要将指针投射到 __m256i
本地;它也违反了 strict-aliasing 所以 it's Undefined Behaviour without -fno-strict-aliasing
even though GCC often makes it work. (.)
gcc9.3(您正在使用)是 storing/reloading 循环内的向量,但将标量保存在 inc eax
!
的寄存器中
矢量循环因此成为矢量 store-forwarding 加上 vpaddd
延迟的瓶颈,而这恰好比标量循环慢 8 倍多。他们的瓶颈是无关的,接近1倍的总速度只是巧合。
(标量循环大概 运行s 在 Zen1 或 Skylake 上每次迭代 1 个周期,7 个周期 store-forwarding 加 1 vpaddd
听起来是正确的)。
它是由 reinterpret_cast<uint32_t*>(&local)
间接引起的,要么是因为 GCC 试图宽恕 strict-aliasing undefined-behaviour 违规,要么只是因为你在拿一个指向本地的指针。
这不是正常的或预期的,但内循环中的原子负载和 lambda 的组合可能会使 GCC9 犯下这个错误。 (请注意,GCC9 和 10 正在从循环内的线程函数 arg 重新加载 address of stopFlag
,即使对于标量也是如此,因此已经存在一些无法将内容保存在寄存器中的问题。 )
在正常情况下 use-cases,您将在每次检查停止标志时执行更多 SIMD 工作,并且通常您不会在迭代中保持矢量状态。通常你会有一个 non-atomic arg 告诉你有多少工作要做,而不是你在内部循环中检查的 stop-flag 。所以这个 missed-opt 错误很少成为问题。 (除非即使没有原子标志也会发生?)
可重现 on Godbolt, showing -DUB_TYPEPUN
vs. -UUB_TYPEPUN
for source where I used #ifdef
to use your unsafe (and missed-opt-triggering) version vs. a safe one with manually-vectorized shuffles from 。 (该手动 hsum 在添加之前不会扩大,因此它可能会溢出和换行。但这不是重点;使用不同的手动洗牌,或 _mm256_store_si256
到单独的数组,可以获得您想要的结果而无需 strict-aliasing 未定义的行为。)
标量循环是:
# g++9.3 -O3 -march=znver1
.L5: # do{
inc eax # local++
.L3:
mov rdx, QWORD PTR [rdi+8] # load the address of stopFlag from the lambda
movzx edx, BYTE PTR [rdx] # zero-extend *&stopFlag into EDX
test dl, dl
je .L5 # }while(stopFlag == 0)
矢量循环,g++ 9.3,-O3 -march=znver1
,使用你的 reinterpret_cast
(即我的源版本中的 -DUB_TYPEPUN
):
# g++9.3 -O3 -march=znver1 with your pointer-cast onto the vector
# ... ymm1 = _mm256_set1_epi32(1)
.L10: # do {
vpaddd ymm1, ymm0, YMMWORD PTR [rsp-32] # memory-source add with set1(1)
vmovdqa YMMWORD PTR [rsp-32], ymm1 # store back into stack memory
.L8:
mov rax, QWORD PTR [rdi+8] # load flag address
movzx eax, BYTE PTR [rax] # load stopFlag
test al, al
je .L10 # }while(stopFlag == 0)
... auto-vectorized hsum, zero-extending elements to 64-bit for vpaddq
但是使用安全的 __m256i
水平总和,完全避免指向 local
的指针,local
保留在寄存器中。
# ymm1 = _mm256_set1_epi32(1)
.L9:
vpaddd ymm0, ymm1, ymm0 # local += set1(1), staying in a register, ymm0
.L8:
mov rax, QWORD PTR [rdi+8] # same loop overhead, still 3 uops (with fusion of test/je)
movzx eax, BYTE PTR [rax]
test al, al
je .L9
... manually-vectorized 32-bit hsum
在我的 Intel Skylake i7-6700k 上,我得到了每个线程数的预期 800 +- 1%,g++ 10.1 -O3 -march=skylake,Arch GNU/Linux,energy_performance_preference=balance_power(最大时钟 = 3.9GHz,任意数量的内核处于活动状态)。
标量循环和矢量循环具有相同的微指令数并且没有不同的瓶颈,因此它们 运行 在相同的循环/迭代中。 (4,如果它可以保持那些地址 -> 停止标志负载的价值链在飞行中,则可能 运行 在每个周期迭代 1 次)。
Zen1 可能不同,因为 vpaddd ymm
是 2 微指令。但是它的 front-end 足够宽,可能仍然 运行 每次迭代循环 1 个周期,所以你也可能在那里看到 800%。
取消注释 ++loop
,我得到 ~267% 的“SIMD 速度”。在 SIMD 循环中有一个额外的 inc,它变为 5 微指令,并且可能会受到一些讨厌的 front-end 对 Skylake 的影响。
-O0
基准测试通常没有意义,它有不同的瓶颈(通常 store/reload 来自将所有内容保存在内存中),并且 SIMD 内在函数通常在 -O0
处有很多额外的开销.尽管在这种情况下,甚至 -O3
也成为 SIMD 循环 store/reload 的瓶颈。
我正在学习和使用 SIMD 函数并编写了一个简单的程序,比较了它在 1 秒 内可以 运行 与普通标量相比的向量加法指令的数量添加。 我发现 SIMD 在较低的优化级别上表现相对较好,而在较高的优化级别上始终较差,我想知道原因 我同时使用了 MSVC 和 gcc,这是同一个故事。以下结果来自Ryzen 7CPU。我还在 Intel 平台上进行了测试,情况也差不多。
#include <iostream>
#include <numeric>
#include <chrono>
#include <iterator>
#include <thread>
#include <atomic>
#include <vector>
#include <immintrin.h>
int main()
{
const auto threadLimit = std::thread::hardware_concurrency() - 1; //for running main()
for (auto i = 1; i <= threadLimit; ++i)
{
std::cerr << "Testing " << i << " threads: ";
std::atomic<unsigned long long> sumScalar {};
std::atomic<unsigned long long> loopScalar {};
std::atomic<unsigned long long> sumSimd {};
std::atomic<unsigned long long> loopSimd {};
std::atomic_bool stopFlag{ false };
std::vector<std::thread> threads;
threads.reserve(i);
{
for (auto j = 0; j < i; ++j)
threads.emplace_back([&]
{
uint32_t local{};
uint32_t loop{};
while (!stopFlag)
{
++local;
++loop; //removed this(see EDIT)
}
sumScalar += local;
loopScalar += loop;
});
std::this_thread::sleep_for(std::chrono::seconds{ 1 });
stopFlag = true;
for (auto& thread : threads)
thread.join();
}
threads.clear();
stopFlag = false;
{
for (auto j = 0; j < i; ++j)
threads.emplace_back([&]
{
const auto oneVec = _mm256_set1_epi32(1);
auto local = _mm256_set1_epi32(0);
uint32_t inc{};
while (!stopFlag)
{
local = _mm256_add_epi32(oneVec, local);
++inc; //removed this(see EDIT)
}
sumSimd += std::accumulate(reinterpret_cast<uint32_t*>(&local), reinterpret_cast<uint32_t*>(&local) + 8, uint64_t{});
loopSimd += inc;
});
std::this_thread::sleep_for(std::chrono::seconds{ 1 });
stopFlag = true;
for (auto& thread : threads)
thread.join();
}
std::cout << "Sum: "<<sumSimd <<" / "<<sumScalar <<"("<<100.0*sumSimd/sumScalar<<"%)\t"<<"Loop: "<<loopSimd<<" / "<<loopScalar<<"("<< 100.0*loopSimd/loopScalar<<"%)\n";
// SIMD/Scalar, higher value means SIMD better
}
}
使用 g++ -O0 -march=native -lpthread
,我得到:
Testing 1 threads: Sum: 1004405568 / 174344207(576.105%) Loop: 125550696 / 174344207(72.0131%)
Testing 2 threads: Sum: 2001473960 / 348079929(575.004%) Loop: 250184245 / 348079929(71.8755%)
Testing 3 threads: Sum: 2991335152 / 521830834(573.238%) Loop: 373916894 / 521830834(71.6548%)
Testing 4 threads: Sum: 3892119680 / 693704725(561.063%) Loop: 486514960 / 693704725(70.1329%)
Testing 5 threads: Sum: 4957263080 / 802362140(617.834%) Loop: 619657885 / 802362140(77.2292%)
Testing 6 threads: Sum: 5417700112 / 953587414(568.139%) Loop: 677212514 / 953587414(71.0174%)
Testing 7 threads: Sum: 6078496824 / 1067533241(569.396%) Loop: 759812103 / 1067533241(71.1746%)
Testing 8 threads: Sum: 6679841000 / 1196224828(558.41%) Loop: 834980125 / 1196224828(69.8013%)
Testing 9 threads: Sum: 7396623960 / 1308004474(565.489%) Loop: 924577995 / 1308004474(70.6861%)
Testing 10 threads: Sum: 8158849904 / 1416026963(576.179%) Loop: 1019856238 / 1416026963(72.0224%)
Testing 11 threads: Sum: 8868695984 / 1556964234(569.615%) Loop: 1108586998 / 1556964234(71.2018%)
Testing 12 threads: Sum: 9441092968 / 1655554694(570.268%) Loop: 1180136621 / 1655554694(71.2835%)
Testing 13 threads: Sum: 9530295080 / 1689916907(563.951%) Loop: 1191286885 / 1689916907(70.4938%)
Testing 14 threads: Sum: 10444142536 / 1805583762(578.436%) Loop: 1305517817 / 1805583762(72.3045%)
Testing 15 threads: Sum: 10834255144 / 1926575218(562.358%) Loop: 1354281893 / 1926575218(70.2948%)
使用 g++ -O3 -march=native -lpthread
,我得到:
Testing 1 threads: Sum: 2933270968 / 3112671000(94.2365%) Loop: 366658871 / 3112671000(11.7796%)
Testing 2 threads: Sum: 5839842040 / 6177278029(94.5375%) Loop: 729980255 / 6177278029(11.8172%)
Testing 3 threads: Sum: 8775103584 / 9219587924(95.1789%) Loop: 1096887948 / 9219587924(11.8974%)
Testing 4 threads: Sum: 11350253944 / 10210948580(111.158%) Loop: 1418781743 / 10210948580(13.8947%)
Testing 5 threads: Sum: 14487451488 / 14623220822(99.0715%) Loop: 1810931436 / 14623220822(12.3839%)
Testing 6 threads: Sum: 17141556576 / 14437058094(118.733%) Loop: 2142694572 / 14437058094(14.8416%)
Testing 7 threads: Sum: 19883362288 / 18313186637(108.574%) Loop: 2485420286 / 18313186637(13.5718%)
Testing 8 threads: Sum: 22574437968 / 17115166001(131.897%) Loop: 2821804746 / 17115166001(16.4872%)
Testing 9 threads: Sum: 25356792368 / 18332200070(138.318%) Loop: 3169599046 / 18332200070(17.2898%)
Testing 10 threads: Sum: 28079398984 / 20747150935(135.341%) Loop: 3509924873 / 20747150935(16.9176%)
Testing 11 threads: Sum: 30783433560 / 21801526415(141.199%) Loop: 3847929195 / 21801526415(17.6498%)
Testing 12 threads: Sum: 33420443880 / 22794998080(146.613%) Loop: 4177555485 / 22794998080(18.3266%)
Testing 13 threads: Sum: 35989535640 / 23596768252(152.519%) Loop: 4498691955 / 23596768252(19.0649%)
Testing 14 threads: Sum: 38647578408 / 23796083111(162.412%) Loop: 4830947301 / 23796083111(20.3014%)
Testing 15 threads: Sum: 41148330392 / 24252804239(169.664%) Loop: 5143541299 / 24252804239(21.208%)
编辑:删除 loop
变量后,在两种情况下只留下 local
(请参阅代码中的编辑),结果仍然相同。
EDIT2:上面的结果是在 Ubuntu 上使用 GCC 9.3。我在 Windows (mingw)、 上切换到 GCC 10.2,它显示了很好的缩放比例,见下文(结果是原始代码)。几乎可以断定是 MSVC 和 GCC 旧版本的问题?
Testing 1 threads: Sum: 23752640416 / 3153263747(753.272%) Loop: 2969080052 / 3153263747(94.159%)
Testing 2 threads: Sum: 46533874656 / 6012052456(774.01%) Loop: 5816734332 / 6012052456(96.7512%)
Testing 3 threads: Sum: 66076900784 / 9260324764(713.548%) Loop: 8259612598 / 9260324764(89.1936%)
Testing 4 threads: Sum: 92216030528 / 12229625883(754.038%) Loop: 11527003816 / 12229625883(94.2548%)
Testing 5 threads: Sum: 111822357864 / 14439219677(774.435%) Loop: 13977794733 / 14439219677(96.8044%)
Testing 6 threads: Sum: 122858189272 / 17693796489(694.357%) Loop: 15357273659 / 17693796489(86.7947%)
Testing 7 threads: Sum: 148478021656 / 19618236169(756.837%) Loop: 18559752707 / 19618236169(94.6046%)
Testing 8 threads: Sum: 156931719736 / 19770409566(793.771%) Loop: 19616464967 / 19770409566(99.2213%)
Testing 9 threads: Sum: 143331726552 / 20753115024(690.652%) Loop: 17916465819 / 20753115024(86.3315%)
Testing 10 threads: Sum: 143541178880 / 20331801415(705.993%) Loop: 17942647360 / 20331801415(88.2492%)
Testing 11 threads: Sum: 160425817888 / 22209102603(722.343%) Loop: 20053227236 / 22209102603(90.2928%)
Testing 12 threads: Sum: 157095281392 / 23178532051(677.762%) Loop: 19636910174 / 23178532051(84.7202%)
Testing 13 threads: Sum: 156015224880 / 23818567634(655.015%) Loop: 19501903110 / 23818567634(81.8769%)
Testing 14 threads: Sum: 145464754912 / 23950304389(607.361%) Loop: 18183094364 / 23950304389(75.9201%)
Testing 15 threads: Sum: 149279587872 / 23585183977(632.938%) Loop: 18659948484 / 23585183977(79.1172%)
reinterpret_cast<uint32_t*>(&local)
在循环将 GCC9 获取到 store/reload local
inside 循环之后,创建一个 store-forwarding瓶颈.
这已在 GCC10 中修复;无需提交 missed-optimization 错误。 不要将指针投射到 __m256i
本地;它也违反了 strict-aliasing 所以 it's Undefined Behaviour without -fno-strict-aliasing
even though GCC often makes it work. (
gcc9.3(您正在使用)是 storing/reloading 循环内的向量,但将标量保存在 inc eax
!
矢量循环因此成为矢量 store-forwarding 加上 vpaddd
延迟的瓶颈,而这恰好比标量循环慢 8 倍多。他们的瓶颈是无关的,接近1倍的总速度只是巧合。
(标量循环大概 运行s 在 Zen1 或 Skylake 上每次迭代 1 个周期,7 个周期 store-forwarding 加 1 vpaddd
听起来是正确的)。
它是由 reinterpret_cast<uint32_t*>(&local)
间接引起的,要么是因为 GCC 试图宽恕 strict-aliasing undefined-behaviour 违规,要么只是因为你在拿一个指向本地的指针。
这不是正常的或预期的,但内循环中的原子负载和 lambda 的组合可能会使 GCC9 犯下这个错误。 (请注意,GCC9 和 10 正在从循环内的线程函数 arg 重新加载 address of stopFlag
,即使对于标量也是如此,因此已经存在一些无法将内容保存在寄存器中的问题。 )
在正常情况下 use-cases,您将在每次检查停止标志时执行更多 SIMD 工作,并且通常您不会在迭代中保持矢量状态。通常你会有一个 non-atomic arg 告诉你有多少工作要做,而不是你在内部循环中检查的 stop-flag 。所以这个 missed-opt 错误很少成为问题。 (除非即使没有原子标志也会发生?)
可重现 on Godbolt, showing -DUB_TYPEPUN
vs. -UUB_TYPEPUN
for source where I used #ifdef
to use your unsafe (and missed-opt-triggering) version vs. a safe one with manually-vectorized shuffles from _mm256_store_si256
到单独的数组,可以获得您想要的结果而无需 strict-aliasing 未定义的行为。)
标量循环是:
# g++9.3 -O3 -march=znver1
.L5: # do{
inc eax # local++
.L3:
mov rdx, QWORD PTR [rdi+8] # load the address of stopFlag from the lambda
movzx edx, BYTE PTR [rdx] # zero-extend *&stopFlag into EDX
test dl, dl
je .L5 # }while(stopFlag == 0)
矢量循环,g++ 9.3,-O3 -march=znver1
,使用你的 reinterpret_cast
(即我的源版本中的 -DUB_TYPEPUN
):
# g++9.3 -O3 -march=znver1 with your pointer-cast onto the vector
# ... ymm1 = _mm256_set1_epi32(1)
.L10: # do {
vpaddd ymm1, ymm0, YMMWORD PTR [rsp-32] # memory-source add with set1(1)
vmovdqa YMMWORD PTR [rsp-32], ymm1 # store back into stack memory
.L8:
mov rax, QWORD PTR [rdi+8] # load flag address
movzx eax, BYTE PTR [rax] # load stopFlag
test al, al
je .L10 # }while(stopFlag == 0)
... auto-vectorized hsum, zero-extending elements to 64-bit for vpaddq
但是使用安全的 __m256i
水平总和,完全避免指向 local
的指针,local
保留在寄存器中。
# ymm1 = _mm256_set1_epi32(1)
.L9:
vpaddd ymm0, ymm1, ymm0 # local += set1(1), staying in a register, ymm0
.L8:
mov rax, QWORD PTR [rdi+8] # same loop overhead, still 3 uops (with fusion of test/je)
movzx eax, BYTE PTR [rax]
test al, al
je .L9
... manually-vectorized 32-bit hsum
在我的 Intel Skylake i7-6700k 上,我得到了每个线程数的预期 800 +- 1%,g++ 10.1 -O3 -march=skylake,Arch GNU/Linux,energy_performance_preference=balance_power(最大时钟 = 3.9GHz,任意数量的内核处于活动状态)。
标量循环和矢量循环具有相同的微指令数并且没有不同的瓶颈,因此它们 运行 在相同的循环/迭代中。 (4,如果它可以保持那些地址 -> 停止标志负载的价值链在飞行中,则可能 运行 在每个周期迭代 1 次)。
Zen1 可能不同,因为 vpaddd ymm
是 2 微指令。但是它的 front-end 足够宽,可能仍然 运行 每次迭代循环 1 个周期,所以你也可能在那里看到 800%。
取消注释 ++loop
,我得到 ~267% 的“SIMD 速度”。在 SIMD 循环中有一个额外的 inc,它变为 5 微指令,并且可能会受到一些讨厌的 front-end 对 Skylake 的影响。
-O0
基准测试通常没有意义,它有不同的瓶颈(通常 store/reload 来自将所有内容保存在内存中),并且 SIMD 内在函数通常在 -O0
处有很多额外的开销.尽管在这种情况下,甚至 -O3
也成为 SIMD 循环 store/reload 的瓶颈。