评估 sizeof 会导致性能提升 C

Question

我有一个动态数组，有一个热循环花费大量时间向我实现的动态数组添加大量元素。

动态数组通过在值数组的开头存储一个 header 来工作：

typedef struct
{
    unsigned char* begin; // Pointer to first element (should just point to end of struct, the header is stored prior data)
    unsigned char* end; // Pointer to last element
    size_t typeSize; // Size of type that the array stores
    size_t capacity; // capacity of array (when size ((end - begin) / typeSize) exceeds capacity, realloc)
} dynamicArr;

出于某种原因，当我将 sizeof item 与 typeSize 进行比较时，循环运行得更快，而当我不进行比较时，循环运行速度大幅提升（增加 30%） .

请注意，比较只是为了安全地防止添加不同类型的项目并导致数组中的错位。如果正确使用动态数组来存储 1 种类型，则无论如何都不应该发生这种情况，因此在实践中应该始终评估为真。

下面是向列表添加元素的代码：

if (arr)
{
    dynamicArr* header = dynamicArr_Header(arr);
    if (header->typeSize && header->typeSize == sizeof item) //If I remove "header->typeSize == sizeof item" performance decreases.
    {
        size_t arrSize = dynamicArr_Size(header);
        if (arrSize == header->capacity)
        {
            size_t newCapacity = (size_t)(header->capacity * 1.5f);
            if (newCapacity == header->capacity) ++newCapacity;
            void* tmp = realloc(header, sizeof(dynamicArr) + header->typeSize * newCapacity);
            if (tmp)
            {
                dynamicArr_Construct(header, tmp, newCapacity, arrSize, header->typeSize);
                *((void**)&(arr)) = header->begin;
                arr[arrSize] = item;
                header->end += header->typeSize;
            }
            else 
            { 
                free(header); 
                arr = NULL; 
            }
        }
        else
        {
            arr[arrSize] = item;
            header->end += header->typeSize;
        }
    }
}

我不明白为什么会这样。我也不太擅长阅读汇编，据我所知，尽管它们有很大的不同，所以如果有人能在这里帮助我，我将不胜感激！

（使用 /O2 和 /Tc 在 MSVC 中编译）

Link 到程序集和其余相关代码。

编辑 1： 很多人似乎认为原因是因为 sizeof item 只是在编译时求值。我不认为是这种情况，因为如果我删除条件并将 header->typeSize 的所有实例替换为 sizeof item，性能仍然比 if 条件存在时更差。 => 我似乎错过了在宏 dynamicArr_Size 中更改 header->typeSize 的使用，这导致了这种混乱，请参阅标记的答案。

完整代码如下：

typedef struct
{
    unsigned char* begin; // Pointer to data
    unsigned char* end; // Pointer to last element
    size_t typeSize; // Size of type
    size_t capacity; // Capacity of array (not number of elements in array)
} dynamicArr;

#define dynamicArr_ConstructBase(dest, src, newCapacity) dest = src; dest->capacity = newCapacity; dest->begin = (unsigned char*)dest + sizeof *dest
#define dynamicArr_Construct(dest, src, newCapacity, currSize, typeSize) dynamicArr_ConstructBase(dest, src, newCapacity); dest->end = dest->begin + typeSize * (currSize)

#define dynamicArr_Header(arr) ((dynamicArr*)((unsigned char*)(arr) - sizeof(dynamicArr)))
static inline size_t dynamicArr_Size(dynamicArr* arr)
{
    return (arr->end - arr->begin) / arr->typeSize;
}

#define dynamicArr_Create(typename, arr) typename* arr = (typename*)dynamicArr_Create_(sizeof(typename))
static inline unsigned char* dynamicArr_Create_(size_t typeSize)
{
    dynamicArr* dynArr;
    void* tmp = malloc(sizeof * dynArr + typeSize * 10);
    if (!tmp) return NULL;

    dynArr = tmp;
    dynArr->begin = (unsigned char*)dynArr + sizeof * dynArr;
    dynArr->end = dynArr->begin;
    dynArr->capacity = 10;
    dynArr->typeSize = typeSize;

    return dynArr->begin;
}

#define dynamicArr_Free(arr) free(dynamicArr_Header(arr))

#define dynamicArr_Push(arr, item) \
do {\
if (arr) \
{ \
    dynamicArr* header = dynamicArr_Header(arr); \
    if (header->typeSize && header->typeSize == sizeof item) \
    { \
        size_t arrSize = dynamicArr_Size(header); \
        if (arrSize == header->capacity) \
        { \
            size_t newCapacity = (size_t)(header->capacity * 1.5f); \
            if (newCapacity == header->capacity) ++newCapacity; \
            void* tmp = realloc(header, sizeof(dynamicArr) + header->typeSize * newCapacity); \
            if (tmp) \
            { \
                dynamicArr_Construct(header, tmp, newCapacity, arrSize, header->typeSize); \
                *((void**)&(arr)) = header->begin; \
                arr[arrSize] = item; \
                header->end += header->typeSize; \
            } \
            else  \
            {  \
                free(header);  \
                arr = NULL;  \
            } \
        } \
        else \
        { \
            arr[arrSize] = item; \
            header->end += header->typeSize; \
        } \
    } \
} \
} while(0)

示例使用：

void Func()
{
    dynamicArr_Create(int, intArr);
    dynamicArr_Push(intArr, 10);
    printf("%i\n", intArr[0]);
    dynamicArr_Free(intArr);
}

至于一个简单的分析测试：

int main()
{
    dynamicArr_Create(int, intArr);

    clock_t begin = clock();

    for (int i = 0; i < 1000000000; ++i)
    {
        dynamicArr_Push(intArr, 10);
    }

    clock_t end = clock();
    double time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
    printf("%f\n", time_spent);

    dynamicArr_Free(intArr);
}

在 Visual Studio 2019 年 windows 中使用 /Tc 在发布模式下编译我得到的结果：

header->typeSize == sizeof item => 3.65 秒
没有 header->typeSize == sizeof item => 9.374 秒
用 sizeof item 替换 header->typeSize 并删除 header->typeSize == sizeof item => 9.302 秒

我重复测试了10次，和上面的结果一致

Answer 1

sizeof 是一个几乎所有时间都在编译时计算的运算符（VLA 是一个例外）

Answer 2

这是非常简单的常量传播，它使编译器能够使用更高效的指令。

由于 sizeof 的值是编译时已知的常量，在具有比较的版本中，编译器知道 typeSize 的值，因此可以使用更有效的指令。这里以dynamicArr_Size(header)的计算为例，即(header->end - header->begin) / header->typeSize:

--- without sizeof
+++ with sizeof
-        mov     rax, QWORD PTR [rbx+8]
-        xor     edx, edx
-        sub     rax, QWORD PTR [rbx]
-        div     r8
+        mov     rdi, QWORD PTR [rbx+8]
+        sub     rdi, QWORD PTR [rbx]
+        shr     rdi, 2

在没有 sizeof 的版本中，编译器必须将元素大小视为未知数并使用实际的除法指令（这还需要将 rdx 寄存器清零，因为该指令需要一个rdx:rax 寄存器对中的 128 位被除数）。当大小已知时，编译器可以替换为更快的位移位并避免触及 rdx。这可能是最有影响力的差异，因为除法指令相对于其他算术而言往往非常慢；大多数编译器，只要有机会（当除数是常量时），就会使用位移位或 wrap-around multiplication tricks just to avoid using the division instruction. (Matt Godbolt has a whole section about division in his talk about compiler optimisations。）更有效地使用寄存器还可以释放寄存器以用于其他地方，并且可以防止将值溢出到内存（尽管在这种情况下似乎没有太大区别）。

另一个例子，下面是 sizeof(dynamicArr) + header->typeSize * newCapacity 的计算方式：

--- without sizeof
+++ with sizeof
-        mov     rdx, rsi
-        imul    rdx, r8
-        add     rdx, 32
+        lea     rdx, QWORD PTR [rsi*4+32]

在没有与 sizeof 进行比较的版本中，编译器必须将 header->typeSize 视为未知数并使用通用乘法指令。当大小已知时，它可以改为使用具有特殊寻址模式的 lea 指令，允许在单个指令中计算值。尽管通用乘法不如除法慢，但移位仍然可以胜过它。但即使 lea 本身不一定更快，更高的代码密度允许更多代码放入指令缓存并避免由于缓存未命中而导致速度减慢。

最后，你声称

A lot of people seem to think that the reason is because sizeof item is simply evaluated at compile time. I don't think that this is the case because if I remove the condition and replace all instances of header->typeSize with sizeof item the performance is still worse than if the if condition was there.

我假设您只替换了宏中字段的直接使用，而不是 dynamicArr_Size 实用程序函数中的间接使用。当您也替换该用法时，the generated code is nearly identical, modulo label names and the immediate value in the if condition check。通过与 sizeof 的比较，编译器在内联该函数时会自动执行此操作。

Answer 3

这里要注意的主要事情是 dynamicArr_Push 是宏而不是函数。

if (header->typeSize && header->typeSize == sizeof item) \

这行提示编译器只会插入 typeSize 个项目，并且它可以预先计算您在 for 循环中需要的最大大小 (1000000000*sizeof(item))并通过在第一个 go 本身分配一个大块来消除重复的内存分配。

for (int i = 0; i < 1000000000; ++i)
{
    dynamicArr_Push(intArr, 10);
}

要对此进行测试，请在 dynamicArr_Create_ 和运行中将初始 capacity 大小更改为 1000000000 (dynArr->capacity = 10;) 并再次删除检查header->typeSize == sizeof item。您应该观察到与 size-check 版本类似的性能改进。或者尝试使用 strace 之类的工具来计算使用和不使用 size-check 时对 realloc 的调用次数。 size-check 版本肯定已经完全淘汰了。

评估 sizeof 会导致性能提升 C

Evaluating sizeof causes improvement in performance C

c

optimization

x86-64

compiler-optimization