返回 2 元组的效率是否低于 std::pair？

Question

考虑这段代码：

#include <utility>
#include <tuple>

std::pair<int, int> f1()
{
    return std::make_pair(0x111, 0x222);
}

std::tuple<int, int> f2()
{
    return std::make_tuple(0x111, 0x222);
}

Clang 3 和 4 在 x86-64 上为两者生成相似的代码：

f1():
 movabs rax,0x22200000111
 ret    
f2():
 movabs rax,0x11100000222 ; opposite packing order, not important
 ret

但是 Clang 5 为 f2() 生成了不同的代码：

f2():
 movabs rax,0x11100000222
 mov    QWORD PTR [rdi],rax
 mov    rax,rdi
 ret

与 GCC 4 到 GCC 7 一样：

f2():
 movabs rdx,0x11100000222
 mov    rax,rdi
 mov    QWORD PTR [rdi],rdx ; GCC 4-6 use 2 DWORD stores
 ret

为什么返回适合单个寄存器的 std::tuple 与 std::pair 相比，生成的代码更差？这似乎特别奇怪，因为 Clang 3 和 4 似乎是最佳的，但 5 不是。

在这里试试：https://godbolt.org/g/T2Yqrj

Answer 1

简短的回答是因为 gcc 和 clang 在 Linux 上使用的 libstc++ 标准库实现使用 [=94= 实现 std::tuple ]非平凡移动构造函数（特别是，_Tuple_impl 基础 class 有一个非平凡的移动构造函数）。另一方面，std::pair的复制和移动构造函数都是默认的。

这反过来会导致 return 从函数调用这些对象以及按值传递它们的调用约定中的 C++-ABI 相关差异。

血淋淋的细节

您运行您在 Linux 上的测试符合 SysV x86-64 ABI。此 ABI 具有将 class es 或结构传递或 returning 到函数的特定规则，您可以阅读有关的更多信息。我们感兴趣的具体情况是这些结构中的两个 int 字段将获得 INTEGER class 还是 MEMORY class。

ABI 规范的 recent 版本是这样说的：

The classification of aggregate (structures and arrays) and union types works as follows:

If the size of an object is larger than eight eightbytes, or it contains un- aligned fields, it has class MEMORY 12 .

If a C++ object has either a non-trivial copy constructor or a non-trivial destructor 13 , it is passed by invisible reference (the object is replaced in the parameter list by a pointer that has class INTEGER) 14 .

If the size of the aggregate exceeds a single eightbyte, each is classified separately. Each eightbyte gets initialized to class NO_CLASS.

Each field of an object is classified recursively so that always two fields are considered. The resulting class is calculated according to the classes of the fields in the eightbyte

这里适用的是条件(2)。请注意，它只提到复制构造函数，而不是 move 构造函数 - 但很明显，鉴于引入通常需要包含的移动构造函数，这可能只是规范中的一个缺陷在之前包含复制构造函数的任何 classification 算法中。特别是 IA-64 cxx-abi，gcc 被记录为遵循 does include move constructors:

If the parameter type is non-trivial for the purposes of calls, the caller must allocate space for a temporary and pass that temporary by reference. Specifically:

Space is allocated by the caller in the usual manner for a temporary, typically on the stack.

然后是非平凡的definition：

A type is considered non-trivial for the purposes of calls if:

it has a non-trivial copy constructor, move constructor, or destructor, or

all of its copy and move constructors are deleted.

因此，因为从 ABI 的角度来看，tuple 不被认为是 可简单复制的 ，所以它得到 MEMORY 处理，这意味着您的函数必须填充rdi 中调用传入的堆栈分配对象。 std::pair 函数可以只传回 rax 中的整个结构，因为它适合一个 EIGHTBYTE 并且具有 class INTEGER.

重要吗？是的，严格来说，像您编译的那样的独立函数对于 tuple 来说效率较低，因为这个 ABI 不同是 "baked in".

然而，编译器通常能够看到函数体并将其内联或执行过程间分析，即使没有内联也是如此。在这两种情况下，ABI 都不再重要，并且两种方法可能同样有效，至少在使用像样的优化器的情况下是这样。例如 let's call your f1() and f2() functions and do some math on the result:

int add_pair() {
  auto p = f1();
  return p.first + p.second;
}

int add_tuple() {
  auto t = f2();
  return std::get<0>(t) + std::get<1>(t);
}

原则上 add_tuple 方法从一个缺点开始，因为它必须调用 f2() ，效率较低，并且还必须在堆栈上创建一个临时元组对象，以便它可以通过它到 f2 作为隐藏参数。好吧，不管怎样，这两个函数都完全优化为直接 return 正确的值：

add_pair():
  mov eax, 819
  ret
add_tuple():
  mov eax, 819
  ret

所以总的来说，你可以说 tuple 这个 ABI 问题的影响相对较小：它给必须符合 ABI 的函数增加了一个小的固定开销，但这只会在非常小的函数的相对意义 - 但此类函数可能会在可以内联的地方声明（或者如果没有，您将性能留在 table 上）。

libcstc++ 与 libc+++

如上所述，这本身是一个 ABI 问题，而不是优化问题。 clang 和 gcc 都已经在 ABI 的约束下最大限度地优化了库代码——如果他们为 std::tuple 的情况生成了像 f1() 这样的代码，他们将破坏 ABI 兼容的调用者。

如果您切换到使用 libc++ 而不是默认的 libstdc++ Linux，您可以清楚地看到这一点 - 此实现没有显式移动构造函数（如 Marc Glisse在评论中提到，为了向后兼容，他们坚持使用此实现）。现在 clang（可能是 gcc，尽管我没有尝试），在两种情况下都会生成 same optimal code：

f1():                                 # @f1()
        movabs  rax, 2345052143889
        ret
f2():                                 # @f2()
        movabs  rax, 2345052143889
        ret

早期版本的 Clang

为什么 clang 的版本编译不同？简直就是 a bug in clang or a bug in the spec depending on how you look at it. The spec didn't explicitly include move construction in the cases where a hidden pointer to a temporary needed to be passed. wasn't conforming to the IA-64 C++ ABI. For example compiled the way clang used to do it was not compatible with gcc or newer versions of clang. The spec was eventually updated and the clang behavior changed in version 5.0.

更新： Marc Glisse 在评论中表示最初对非平凡移动构造函数和 C++ ABI 的交互存在混淆，并且 clang 在某些时候改变了他们的行为，这可能解释了开关：

The ABI specification for some argument passing cases involving move constructors were unclear, and when they were clarified, clang changed to follow the ABI. This is probably one of those cases.

返回 2 元组的效率是否低于 std::pair？

Is returning a 2-tuple less efficient than std::pair?

c++

gcc

clang

calling-convention

stdtuple

血淋淋的细节

libcstc++ 与 libc+++

早期版本的 Clang

返回 2 元组的效率是否低于 std::pair？

Is returning a 2-tuple less efficient than std::pair?

c++

gcc

clang

calling-convention

stdtuple

血淋淋的细节

libcs​​tc++ 与 libc+++

早期版本的 Clang

libcstc++ 与 libc+++