为什么 MSVC 在 std::bitset::count 的实现中不使用 __popcnt？

Question

我很想知道 MSVC 是否为 bitset::count.

使用了编译器内部 __popcnt

环顾四周，我发现这是 std::bitset::count for VS2017 的实现：

size_t count() const _NOEXCEPT
        {   // count number of set bits
        const char *const _Bitsperbyte =
            "[=12=]"
            ""
            ""
            ""
            ""
            ""
            ""
            ""
            ""
            ""
            ""
            ""
            ""
            ""
            ""
            "\x8";
        const unsigned char *_Ptr = &reinterpret_cast<const unsigned char&>(_Array);
        const unsigned char *const _End = _Ptr + sizeof (_Array);
        size_t _Val = 0;
        for ( ; _Ptr != _End; ++_Ptr)
            _Val += _Bitsperbyte[*_Ptr];
        return (_Val);
        }

看起来它使用查找表来获取任何给定字节的位数，然后计算每个字节的 1 的数量。

According to this answer here，GCC 是这样实现的（按照我的思路）：

/// Returns the number of bits which are set.
size_t
count() const { return this->_M_do_count(); }      

size_t
_M_do_count() const
{
  size_t __result = 0;
  for (size_t __i = 0; __i < _Nw; __i++)
  __result += __builtin_popcountl(_M_w[__i]);
  return __result;
}

虽然我没有进行任何基准测试，但我敢打赌 GCC 的实施在这里会快很多。

因此，MSVC 如此实现 std::bitset::count 是否有任何令人信服的理由？我的猜测是 MSVC 有一个包罗万象的 "no compiler intrinsics in STL" 政策，或者我忽略的两个平台之间存在差异。

Answer 1

__builtin_popcountl 在 GCC 中的内部实现并不更好，根据体系结构，它类似于下面的内容。

i = i - ((i >> 1) & 0x55555555);
i = (i & 0x33333333) + ((i >> 2) & 0x33333333);
return (((i + (i >> 4)) & 0xF0F0F0F) * 0x1010101) >> 24;

并且只针对SSE4a指令集，只支持2006年以后的AMD CPU，__builtin_popcountl由一条汇编指令组成POPCNT。

MSDN 说

Each of these intrinsics generates the popcnt instruction. The size of the value that the popcnt instruction returns is the same as the size of its argument. In 32-bit mode there are no 64-bit general-purpose registers, hence no 64-bit popcnt.

To determine hardware support for the popcnt instruction, call the __cpuid intrinsic with InfoType=0x00000001 and check bit 23 of CPUInfo[2] (ECX). This bit is 1 if the instruction is supported, and 0 otherwise. If you run code that uses this intrinsic on hardware that does not support the popcnt instruction, the results are unpredictable.

我认为 MSVC 团队不想使用带有条件的内在条件，而希望使用一种独立于 CPU 和体系结构的通用解决方案。

Answer 2

没有“STL 中没有编译器内在函数”政策，但要求 STL 应运行在所有受支持的 CPU 上，并且最小 CPU 是根据最低 OS 版本的最低要求选择的。因此，仅当提供旧 CPUs 的回退时，才有可能使用为以后的 CPUs 注入指令的内在函数。

目前 VS 2019 中的 C++20 std::popcount 使用运行time CPU 检测，它回落到位计数。

std::bitset::count 也可以开始使用相同的方法。在 STL GitHub 中有一个 issue 等待维护者或贡献者实现。

为什么 MSVC 在 std::bitset::count 的实现中不使用 __popcnt？

Why does MSVC not use __popcnt in its implementation for std::bitset::count?

c++

gcc

stl

visual-c++

std-bitset