内联汇编+指针管理

Question

我对在 C++ 代码中使用内联汇编非常陌生。我想做的基本上是一种大小为模数 32 的指针的 memcopy。

在 C++ 中，代码通常是这样的：

void my_memcpy(const std::uint8_t* in,std::uint8_t* out,const std::size_t& sz)
{

       assert((sz%32 == 0));

    for(const std::uint8_t* it = beg; it != (beg+sz);it+=32,out+=32)
    {
      __m256i = _mm256_stream_load_si256(reinterpret_cast<__m256i*>(it));
      _mm256_stream_si256(reinterpret_cast<__m256i*>(out),tmp);

    }            
}

我已经做了一些内联汇编，但每次我都提前知道输入选项卡和输出选项卡的大小。

所以我尝试了这个：

void my_memcpy(const std::uint8_t* in,std::uint8_t* out,const std::size_t& sz)
{

     assert((sz%32 == 0));

    __asm__ volatile(

                "mov %1, %%eax \n"
                "mov [=11=], %%ebx \n"

                "L1: \n"

                "vmovntdqa (%[src],%%ebx), %%ymm0 \n"
                "vmovntdq  %%ymm0, (%[dst],%%ebx) \n"

                "add %%ebx,  \n"

                "cmp %%eax, %%ebx \n"
                "jz L1 \n"

                :[dst]"=r"(out)
                :[src]"r"(in),"m"(sz)
                :"memory"
                );

}

G++ 告诉我：

Error: unsupported instruction `mov'
Error: `(%rdi,%ebx)' is not a valid base/index expression
Error: `(%rdi,%ebx)' is not a valid base/index expression
Error: operand type mismatch for `add'

所以我尝试了这个：

void my_memcpy(const std::uint8_t* in,std::uint8_t* out,const std::size_t& sz)
{

     assert((sz%32 == 0));
__asm__ volatile(

            "mov %1, %%eax \n"
            "mov [=13=], %%ebx \n"

            "L1: \n"

            "vmovntdqa %%ebx(%[src]), %%ymm0 \n"
            "vmovntdq  %%ymm0, (%[dst],%%ebx) \n"

            "add %%ebx,  \n"

            "cmp %%eax, %%ebx \n"
            "jz L1 \n"

            :[dst]"=r"(out)
            :[src]"r"(in),"m"(sz)
            :"memory"
                );

}

我从G++获得：

Error: unsupported instruction `mov'
Error: junk `(%rdi)' after register
Error: `(%rdi,%ebx)' is not a valid base/index expression
Error: operand type mismatch for `add'

在每种情况下，我都试图找到解决方案，但没有成功。我也遇到过这个解决方案：

void my_memcpy(const std::uint8_t* in,std::uint8_t* out,const std::size_t& sz)
{

    __asm__ volatile (
          ".intel_syntax noprefix;"

          "mov eax, [SZ];"
          "mov ebx, 0;"

          "L1 : "

          "vmovntdqa ymm0, [src+ebx];"
          "vmovntdq [dst+ebx], ymm0;"

          "add ebx, 32 \n"

          "cmp ebx, eax \n"
          "jz L1 \n"
                ".att_syntax;"
          : [dst]"=r"(out)
          : [SZ]"m"(sz),[src]"r"(in)
          : "memory");



}

G++ :

undefined reference to `SZ'
undefined reference to `src'
undefined reference to `dst'

其中的消息看起来很常见，但我不知道在那种情况下如何解决它。

我也知道我的尝试并不严格代表我用 C++ 编写的代码。

我想了解我的尝试有什么问题，以及如何翻译尽可能接近我的 C++ 函数。

提前致谢。

Answer 1

你的第一个例子是最正确的，但有以下错误：

它使用 32 位寄存器而不是 64 位。
3 个未指定为输出或破坏的寄存器已更改。
EAX 加载源地址，而不是大小。
dst 被声明为输出，而它应该是输入。
add 指令的参数是错误的，在 AT&T 语法中目标寄存器在最后。
使用了非本地标签，如果 asm 语句被复制（例如通过内联），它将失败。

以及以下性能问题：

sz参数通过引用传递。（也可能会影响调用函数的优化）
然后作为内存参数传递给 asm，这需要将其写入内存。
然后复制到另一个寄存器。
使用固定寄存器而不是让编译器选择。

这是一个固定版本，它并不比具有内部函数的等效 C++ 快：

void my_memcpy(const std::uint8_t* in,std::uint8_t* out,const std::size_t sz)
{
     std::size_t count = 0;
     __m256i temp;

     assert((sz%32 == 0));

    __asm__ volatile(

                "1: \n"

                "vmovntdqa (%[src],%[count]), %[temp] \n"
                "vmovntdq  %[temp], (%[dst],%[count]) \n"

                "add , %[count] \n"

                "cmp %[sz], %[count] \n"
                "jz 1b \n"

                :[count]"+r"(count), [temp]"=x"(temp)
                :[dst]"r"(out), [src]"r"(in), [sz]"r"(sz)
                :"memory", "cc"
                );

}

源参数和目标参数是相反的 memcpy，这可能会造成混淆。

您的 Intel 语法版本添加也未能使用正确的语法来引用参数（例如 %[dst]）。

内联汇编+指针管理

inline assembly + pointer management

c++

inline-assembly

avx2