在 amd64 架构上用 C++ 将图像缓冲区 blit 到另一个缓冲区的 xy 偏移的最快方法

fastest way to blit image buffer into an xy offset of another buffer in C++ on amd64 architecture

我有任意大小的图像缓冲区,我以 x,y 偏移将其复制到大小相等或更大的缓冲区中。色彩空间是 BGRA。我目前的复制方法是:

void render(guint8* src, guint8* dest, uint src_width, uint src_height, uint dest_x, uint dest_y, uint dest_buffer_width) {
    bool use_single_memcpy = (dest_x == 0) && (dest_y == 0) && (dest_buffer_width == src_width);

    if(use_single_memcpy) {
        memcpy(dest, src, src_width * src_height * 4);
    }
    else {
        dest += (dest_y * dest_buffer_width * 4);
        for(uint i=0;i < src_height;i++) {
            memcpy(dest + (dest_x * 4), src, src_width * 4);
            dest += dest_buffer_width * 4;
            src += src_width * 4;
        }
    }
}

它运行得很快,但我很好奇我是否可以做些什么来改进它并获得额外的几毫秒。如果它涉及到汇编代码,我宁愿避免这种情况,但我愿意添加额外的库。

您的 use_single_memcpy 测试太严格了。稍微重新排列即可删除 dest_y == 0 要求。

void render(guint8* src, guint8* dest,
            uint src_width, uint src_height, 
            uint dest_x, uint dest_y,
            uint dest_buffer_width)
{
    bool use_single_memcpy = (dest_x == 0) && (dest_buffer_width == src_width);
    dest_buffer_width <<= 2;
    src_width <<= 2;
    dest += (dest_y * dest_buffer_width);

    if(use_single_memcpy) {
        memcpy(dest, src, src_width * src_height);
    }
    else {
        dest += (dest_x << 2);
        while (src_height--) {
            memcpy(dest, src, src_width);
            dest += dest_buffer_width;
            src += src_width;
        }
    }
}

我也把循环改成了倒计时(这样可能效率更高),去掉了一个无用的临时变量,去掉了重复计算。

使用 SSE 内部函数一次复制 16 个字节而不是 4 个字节可能会做得更好,但是您需要担心对齐和 4 像素的倍数。一个好的 memcpy 实现应该已经完成​​了这些事情。

Whosebug 上的一个流行答案,确实使用 x86-64 程序集和 SSE,可在此处找到:Very fast memcpy for image processing?。如果您确实使用此代码,则需要确保您的缓冲区是 128 位对齐的。该代码的基本解释是:

  • 使用非临时存储,因此可以绕过不必要的缓存写入,并可以合并对主内存的写入。
  • 读取和写入仅在非常大的块中交错(进行多次读取,然后进行多次写入)。连续执行多次读取通常比单个读取-写入-读取-写入模式具有更好的性能。
  • 使用了更大的寄存器(128 位 SSE 寄存器)。
  • 包含预取指令作为 CPU 流水线的提示。

我找到了这篇文档 - Optimizing CPU to Memory Accesses on the SGI Visual Workstations 320 and 540 - 这似乎是上述代码的灵感来源,但适用于较老的处理器世代;但是,它确实包含大量关于其工作原理的讨论。

例如,考虑这个关于写组合/非临时存储的讨论:

The Pentium II and III CPU caches operate on 32-byte cache-line sized blocks. When data is written to or read from (cached) memory, entire cache lines are read or written. While this generally enhances CPU-memory performance, under some conditions it can lead to unnecessary data fetches. In particular, consider a case where the CPU will do an 8-byte MMX register store: movq. Since this is only one quarter of a cache line, it will be treated as a read-modify-write operation from the cache's perspective; the target cache line will be fetched into cache, then the 8-byte write will occur. In the case of a memory copy, this fetched data is unnecessary; subsequent stores will overwrite the remainder of the cache line. The read-modify-write behavior can be avoided by having the CPU gather all writes to a cache line then doing a single write to memory. Coalescing individual writes into a single cache-line write is referred to as write combining. Write combining takes place when the memory being written to is explicitly marked as write combining (as opposed to cached or uncached), or when the MMX non-temporal store instruction is used. Memory is generally marked write combining only when it is used in frame buffers; memory allocated by VirtualAlloc is either uncached or cached (but not write combining). The MMX movntps and movntq non-temporal store instructions instruct the CPU to write the data directly to memory, bypassing the L1 and L2 caches. As a side effect, it also enables write combining if the target memory is cached.

如果您更愿意坚持使用 memcpy,请考虑调查您正在使用的 memcpy 实现的源代码。一些 memcpy 实现寻找本机字对齐的缓冲区,以通过使用完整的寄存器大小来提高性能;其他人会使用原生词对齐自动复制尽可能多的内容,然后清除剩余部分。确保您的缓冲区是 8 字节对齐的将有助于这些机制。

一些 memcpy 实现包含大量预先条件,以使其对小缓冲区 (<512) 有效 - 您可能需要考虑复制粘贴代码并删除这些块,因为您大概是不适用于小缓冲区。