在 C 中初始化所有元素只有两个字节的数组的最快方法是什么？

Question

假设我们有一个名为的数组：

uint8_t data_8_bit[2] = {color >> 8, color & 0xff};

数据为16位颜色数据。我们的目标是创建一个名为：

的数组

uint8_t data_16_bit[2*n];

其中n实际上是16位数据数组的长度。但是数组 data_16_bit 不能保存 16 位值，因此我添加了一个 2*n 作为数组大小。

当然，我知道我可以使用 for 循环填充数组 data_16_bit：

for(int i = 0; i < n; i++)
    for(int j = 0; j < 2; j++)
        data_16_bit[j*i] = data_8_bit[j];

但一定有比这更快的方法吧？ memset 或 memcpy?

Answer 1

IMO 最容易被编译器优化（也非常安全）的是

void foo(uint16_t color, uint8_t *arr16, size_t n)
{
    uint8_t data_8_bit[2] = {color >> 8, color & 0xff};

    while(n--)
    {
        memcpy(arr16 + n * 2, data_8_bit, 2);
    }
}

https://godbolt.org/z/8Wh5Pc3aP

Answer 2

看来您要做的是确保 data_16_bit 中偶数索引处的每个元素都包含与 data_8_bit[0] 相同的值，奇数索引处的每个元素包含相同的值值为 data_8_bit[1].

标准 C 不提供通过初始化程序表达此类内容的方法。
memset() 本身并不能提供比普通迭代更好的解决方案，因为您试图将目标字节设置为交替值而不是全部设置为相同值。
memcpy() 不会产生任何比简单迭代赋值更好（如果有的话）的简单方法，因为源模式只有两个字节。在一般情况下，对 memcpy() 执行少于 n 次调用是可能的，但实现这一点的代码将相当复杂。

如果 n 是一个编译时常量，那么最快的方法就是写出一个完整的初始化程序：

uint8_t data_16_bit[2*8] = {
  color >> 8, color & 0xff,
  color >> 8, color & 0xff,
  color >> 8, color & 0xff,
  color >> 8, color & 0xff,
  color >> 8, color & 0xff,
  color >> 8, color & 0xff,
  color >> 8, color & 0xff,
  color >> 8, color & 0xff
};

如果n不是编译时常量则

您应该考虑使用动态分配的内存而不是 VLA，并且
您不能使用初始化器。

在那种情况下，像您的 for 循环这样的东西可能已经很好了。不过，我会这样写：

for(int i = 0; i < n * 2; i += 2) {
    data_16_bit[i] =   data_8_bit[0];
    data_16_bit[i+1] = data_8_bit[1];
}

Answer 3

我们可以使用 uint16_t 指针存储到 uint8_t 数组中。

这是可取的，因为原始填充值是 16 位。

我们甚至可以一次存储 64 位。我们将 16 位颜色复制四次以获得 uint64_t 值。我们使用 uint64_t 指针存储到数组中。

此方法是内置 memset 函数尝试执行的操作 [它会尝试使用 XMM 寄存器]。

这是我的想法。

请注意，如果我们需要对齐 8 字节边界 [如果 arch 需要这样做]，我们可以从字节存储开始，然后在中间进行宽存储并在末尾恢复为 char/short 存储.

#include <stddef.h>
#include <stdint.h>

void
fill8(uint8_t *data,uint16_t color,size_t len)
{
    uint64_t *p64;
    uint16_t *p16;
    size_t count;

    // NOTE: enhancement would be to align the data pointer to 8 byte boundary
    // by transferring 16 bit data as below

    // get 64 bit color value
    uint64_t c64 = 0;
    c64 = (c64 << 16) | color;
    c64 = (c64 << 16) | color;
    c64 = (c64 << 16) | color;
    c64 = (c64 << 16) | color;

    // get pointer to 64 bit data and its count
    p64 = (uint64_t *) data;
    count = len / sizeof(c64);

    // increment byte pointer and decrement byte length
    data += count * sizeof(c64);
    len -= count * sizeof(c64);

    // transfer data 64 bits at a time
    for (;  count > 0;  --count, ++p64)
        *p64 = c64;

    // get pointer to 16 bit data
    p16 = (uint16_t *) data;
    count = len / sizeof(color);

    // increment byte pointer and decrement byte length
    data += count * sizeof(color);
    len -= count * sizeof(color);

    // transfer data 16 bits at a time
    for (;  count > 0;  --count, ++p16)
        *p16 = color;
}

更新：

It leads to UB. Strict alising. If platform does not allow unaligned access and *data is unaligned it will result in the exception – 0___________

未对齐访问不是“严格别名”违规。它是 未对齐 访问。在某些体系结构上，这将导致对齐异常 [在硬件中]。我在上面的评论中解决了这个问题，但是，公平地说，下面有一个重新设计的示例，其中包含对齐代码。

Note that we can start with byte stores if we need to align to an 8 byte boundary Alignment has nothing to do with strict aliasing. It doesn't matter where the data that uint_t *data points to - in this function it's of type uint8_t and referring to it with a uint64_t * pointer unequivocally violates strict aliasing. – Andrew Henle

不，我不认为它违反了“严格别名”，因为它不适用于此处。 “违反类型双关语”可能会让你更幸运，但我对此也表示怀疑。

[下面] [更新] 代码与 Freebsd 的 memset 非常相似。

而且，即使代码违反了“规则”，也有已知的解决方法和例外情况：

链接中有更多详细信息，但严格的别名允许编译器优化函数以访问 a 和 b 即：

void func(int *a,long *b)

但是，它真的想要：

void func(int * restrict a,long * restrict b)

如果没有 restrict，编译器将无法对指针进行 [可疑] 优化，因为它无法确定它们是否重叠。

在这里，编译器可以推断指针关系并生成正确的代码，因为指针的方式是set/incremented。

如果我们必须，为了 [完全] 安全，使用 -fno-strict-aliasing 编译 [但我认为在这种情况下不需要它]。

可能是“双关语”。但是，因为数据是 unsigned，并且关系是已知的（例如 uint8_t 和 uint64_t），所以生成的代码仍然是正确的。

正如我提到的，memset 执行类似的 pointer/data 操作。如果这里的代码是“坏的”，那么 libc 也被破坏了。请参阅下面的 Freebsd memset 实现。

我在评论中提到的指针对齐问题有效。这是一些重新编写的代码来解决对齐问题：

#include <stdio.h>
#include <stddef.h>
#include <stdint.h>
#include <stdlib.h>
#include <string.h>
#include <byteswap.h>

#define OFFOF(_ptr) \
    ((uintptr_t) (_ptr))

int opt_b;
int opt_v;

#define sysfault(_fmt...) \
    do { \
        printf(_fmt); \
        exit(1); \
    } while (0);

uint8_t phys[1000000];

void
fill8(uint8_t *data,uint16_t color,size_t len)
{
    uint64_t *p64;
    uint16_t *p16;
    size_t count;

    if (opt_b)
        color = bswap_16(color);

    // NOTE: enhancement would be to align the data pointer to 8 byte boundary
    // by transferring 16 bit data as below
    for (;  (len > 0) && (OFFOF(data) & 0x07);  ++data, --len) {
        *data = color;
        color = bswap_16(color);
    }

    // get 64 bit color value
    uint64_t c64 = 0;
    c64 = (c64 << 16) | color;
    c64 = (c64 << 16) | color;
    c64 = (c64 << 16) | color;
    c64 = (c64 << 16) | color;

    // get pointer to 64 bit data and its count
    p64 = (uint64_t *) data;
    count = len / sizeof(c64);

    // increment byte pointer and decrement byte length
    data += count * sizeof(c64);
    len -= count * sizeof(c64);

    // transfer data 64 bits at a time
    for (;  count > 0;  --count, ++p64)
        *p64 = c64;

    // transfer data 8 bits at a time
    for (;  len > 0;  ++data, --len) {
        *data = color;
        color = bswap_16(color);
    }
}

void
verify(size_t len,size_t align)
{
    uint8_t *data;
    size_t idx;
    uint8_t val;

    data = &phys[0];
    for (idx = 0;  idx < align;  ++idx) {
        val = data[idx];
        if (val != 0)
            sysfault("verify: BEF idx=%zu val=%2.2X\n",idx,val);
    }

    data = &phys[align];
    for (idx = 0;  idx < len;  ++idx) {
        val = data[idx];

        if (opt_v) {
            printf(" %2.2X",val);
            if ((idx % 16) == 15)
                printf("\n");
        }

        if (val == 0)
            sysfault("verify: DAT idx=%zu val=%2.2X\n",idx,val);
    }

    if (opt_v)
        printf("\n");

    data = &phys[align + len];
    for (idx = 0;  idx < align;  ++idx) {
        val = phys[idx];
        if (val != 0)
            sysfault("verify: AFT idx=%zu val=%2.2X\n",idx,val);
    }
}

void
dotest(int tstno,size_t len,size_t align)
{
    uint8_t *data = phys;

    memset(phys,0,sizeof(phys));

    while (1) {
        uintptr_t off = data - phys;
        if (off == align)
            break;
        ++data;
    }

    if ((tstno > 1) && opt_v)
        printf("\n");
    printf("T:%d %p L:%zu A:%zu\n",tstno,data,len,align);

    uint16_t color = 0x0102;
    fill8(data,color,len);

    verify(len,align);
}

int
main(int argc,char **argv)
{
    int tstno = 1;

    --argc;
    ++argv;

    for (;  argc > 0;  --argc, ++argv) {
        char *cp = *argv;
        if (*cp != '-')
            break;

        cp += 2;
        switch(cp[-1]) {
        case 'b':  // big endian
            opt_b = ! opt_b;
            break;
        case 'v':  // verbose check
            opt_v = ! opt_v;
            break;
        }
    }

    for (size_t len = 1;  len <= 128;  ++len) {
        for (size_t align = 0;  align < 8;  ++align, ++tstno)
            dotest(tstno,len,align);
    }

    return 0;
}

这是 [其中一个] Freebsd 的 memset 实现：

/*-
 * Copyright (c) 1990, 1993
 *  The Regents of the University of California.  All rights reserved.
 *
 * This code is derived from software contributed to Berkeley by
 * Mike Hibler and Chris Torek.
 *
 * Redistribution and use in source and binary forms, with or without
 * modification, are permitted provided that the following conditions
 * are met:
 * 1. Redistributions of source code must retain the above copyright
 *    notice, this list of conditions and the following disclaimer.
 * 2. Redistributions in binary form must reproduce the above copyright
 *    notice, this list of conditions and the following disclaimer in the
 *    documentation and/or other materials provided with the distribution.
 * 3. Neither the name of the University nor the names of its contributors
 *    may be used to endorse or promote products derived from this software
 *    without specific prior written permission.
 *
 * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
 * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
 * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
 * ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
 * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
 * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
 * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
 * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
 * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
 * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
 * SUCH DAMAGE.
 */

#if defined(LIBC_SCCS) && !defined(lint)
static char sccsid[] = "@(#)memset.c    8.1 (Berkeley) 6/4/93";
#endif /* LIBC_SCCS and not lint */
#include <sys/cdefs.h>
__FBSDID("$FreeBSD$");

#include <sys/types.h>

#include <limits.h>

#define wsize   sizeof(u_int)
#define wmask   (wsize - 1)

#ifdef BZERO
#include <strings.h>

#define RETURN  return
#define VAL 0
#define WIDEVAL 0

void
bzero(void *dst0, size_t length)
#else
#include <string.h>

#define RETURN  return (dst0)
#define VAL c0
#define WIDEVAL c

void *
memset(void *dst0, int c0, size_t length)
#endif
{
    size_t t;
#ifndef BZERO
    u_int c;
#endif
    u_char *dst;

    dst = dst0;
    /*
     * If not enough words, just fill bytes.  A length >= 2 words
     * guarantees that at least one of them is `complete' after
     * any necessary alignment.  For instance:
     *
     *  |-----------|-----------|-----------|
     *  |00|01|02|03|04|05|06|07|08|09|0A|00|
     *            ^---------------------^
     *       dst         dst+length-1
     *
     * but we use a minimum of 3 here since the overhead of the code
     * to do word writes is substantial.
     */
    if (length < 3 * wsize) {
        while (length != 0) {
            *dst++ = VAL;
            --length;
        }
        RETURN;
    }

#ifndef BZERO
    if ((c = (u_char)c0) != 0) {    /* Fill the word. */
        c = (c << 8) | c;   /* u_int is 16 bits. */
#if UINT_MAX > 0xffff
        c = (c << 16) | c;  /* u_int is 32 bits. */
#endif
#if UINT_MAX > 0xffffffff
        c = (c << 32) | c;  /* u_int is 64 bits. */
#endif
    }
#endif
    /* Align destination by filling in bytes. */
    if ((t = (long)dst & wmask) != 0) {
        t = wsize - t;
        length -= t;
        do {
            *dst++ = VAL;
        } while (--t != 0);
    }

    /* Fill words.  Length was >= 2*words so we know t >= 1 here. */
    t = length / wsize;
    do {
        *(u_int *)dst = WIDEVAL;
        dst += wsize;
    } while (--t != 0);

    /* Mop up trailing bytes, if any. */
    t = length & wmask;
    if (t != 0)
        do {
            *dst++ = VAL;
        } while (--t != 0);
    RETURN;
}

Answer 4

虽然很多人都不知道，但如果 sizeof(wchar_t) 在您的平台上是 2 的倍数，例如当它是 2 字节类型时，您可以使用 wmemset：

_Static_assert(sizeof(wchar_t)*CHAR_BIT == 16);

wchar_t pattern;
memcpy(&pattern, data_8_bit, 2);
wmemset((wchar_t*)data_16_bit, pattern, n);

如果 wchar_t 是大多数 *nix 平台上的 4 字节类型

_Static_assert(sizeof(wchar_t)*CHAR_BIT == 32);

wchar_t pattern;
memcpy(&pattern, data_8_bit, 2);
memcpy((char*)&pattern + 2, data_8_bit, 2);
wmemset((wchar_t*)data_16_bit, pattern, n);

如果 wchar_t 更大（极不可能），则只需重复创建填充图案的第一步

wmemset 应该像 memset 一样在汇编中使用 SIMD 进行手动优化，因此与没有编译器的其他解决方案相比，它会非常快无法自动矢量化。例如有很多 optimized memset and wmemset versions for x86-64 in glibc including SSE2, AVX2 and even AVX-512

Answer 5

需要考虑的几个问题。

初始化后，此数据是只读的还是可修改的？
元素个数是固定的吗？可在构建时配置？或者在运行时变化？
元素的最大数量是否有合理的限制？
您需要多少次初始化缓冲区？

从第三个问题开始，由于您处理的是 16 位数据，因此看起来您将拥有最多 65536 个唯一集。如果您愿意为速度牺牲一点 space，您可以创建一个 64 kB 的全局 table，其中包含按您期望的顺序排列的所有排列。最终结果是您将 table 自动加载到内存中，因为它将驻留在您的 executable 中的数据部分之一。你如何 create/populate 这个 table 取决于你。（例如，您可以手动创建 table，或者您可以在构建过程中有一个专门的步骤来创建它并将其编译成一个可链接的目标文件。）

继续，假设您的 table 已加载到内存中。

如果您预先填充的 table 至少与您需要的一样大，您可以...

Return 如果内容永远不会改变，则指向 table 的指针（对于这种情况，最好将其驻留在只读数据部分）。好处是不需要复制更多的数据——你只需要移动一个指针。
使用内存复制例程，例如 memcpy()（如果您不想要编译器生成的内容，则使用自定义例程）从预填充的 table 复制到您想要的缓冲区，如果内容将要更改，或者如果您的目标缓冲区大于预填充的 64 kB table.

在 C 中初始化所有元素只有两个字节的数组的最快方法是什么？

What is the fastest way to initialize an array in C with only two bytes for all elements?

c

arrays

memset

memcpy

fill