Example to support the statement pass by value is not good practice even for small user defined types
我正在阅读 Scott meyers 的 Effective C++,作者正在比较按值传递和按引用传递。对于用户定义的类型,建议使用按引用传递,对于内置类型按值传递。我正在寻找一个示例来解释以下段落,即即使对于小型用户定义的对象,按值传递的状态也可能代价高昂。
Built-in types are small, so some people conclude that all small types
are good candidates for pass-by-value, even if they’re user-defined.
This is shaky reasoning. Just because an object is small doesn’t mean
that calling its copy constructor is inexpensive. Many objects — most
STL containers among them — contain little more than a pointer, but
copying such objects entails copying everything they point to. That
can be very expensive.
这取决于你的副本是深拷贝还是浅拷贝。(或者value-like class/pointer-like class)。例如,A是一个class只有一个指向另一个对象的指针:
struct B;
struct A
B* pB;
~A{delete pB;}
in a1
指向同一个堆 memory.That 也就是说,dtor ~A()
struct A
B* pB;
const A& operator=(const A&rhs)
delete pB;
pB=new pB;
return *this;
//the copy/move constructor/assignment should also be redefined
~A{delete pB;}
综上所述,如果你的 class 是 trivially copyable,那么复制一个小的 user-defined class,或者按值传递,花费不多,否则视情况而定。
如果你仍然想按值传递并且不想触发未定义的行为,shared_ptr 可能是 you.But 所指出的一个不错的选择@Arne Vogel 提出,shared_ptr 的实现是 thread-safe,这需要对引用计数进行原子操作,这会增加成本。
'cost' 只是浪费了 CPU 个周期。
#include <iostream>
class simple {
simple() { std::cout << "constructor" << std::endl; }
simple(const simple& copy) { std::cout << "copied" << std::endl; }
~simple() { std::cout << "destructor" << std::endl; }
void addr() const { std::cout << &(*this) << std::endl; }
void simple_ref(const simple& ref) { ref.addr(); }
void simple_val(simple val) { val.addr(); }
int main(int argc, char* argv[])
simple val; // output: 'constructor'
simple_ref(val); // output: address of val
simple_val(val); // output: 'copied', address of copy made, 'destructor' (the destructor of the copy made)
return 0;
// output: 'destructor' (destructor of 'val')
这里没有成员数据,所以在我的机器上给出 sizeof(simple)
的输出,给我 1
,但是调用一个函数是 value,而不是 reference,调用一个副本,即使是像打印变量地址这样简单的事情。
(这是关于 copy vs ref 的博文内容,来自 Thiago Macieira,https://www.macieira.org/blog/2012/02/the-value-of-passing-by-value/)
在我们进入 ABI 文档并尝试编译代码之前,我们需要定义我们要解决的问题。一般来说,我正在尝试找到传递小型 C++ 结构的最佳方式:什么时候按值传递比按常量引用传递更好?在这些条件下,qreal 讨论是否有任何重要意义?
像 QLatin1String 这样只包含一个指针作为成员的小型结构将从按值传递中受益。我们还应该关注哪些其他类型的结构?
- 具有多个指针的结构
- 64 位架构上的 32 位整数结构
- 具有 floating-point 的结构(单精度和双精度)
- Mixed-type 和 Qt 中的特殊结构
我将研究 x86-64、ARMv7 hard-float、MIPS hard-float (o32) 和 IA-64 ABI,因为它们是我可以访问编译器的对象。它们都支持通过寄存器传递参数,并且至少有 4 个整数寄存器用于参数传递。除了 MIPS,它们都至少有 4 个 floating-point 寄存器用于参数传递。有关更多信息,请参阅我之前的 ABI 详细信息博客。
struct Pointers2
void *p1, *p2;
struct Pointers4
void *p1, *p2, *p3, *p4;
struct Integers2 // like QSize and QPoint
int i1, i2;
struct Integers4 // like QRect
int i1, i2, i3, i4;
template <typename F> struct Floats2 // like QSizeF, QPointF, QVector2D
F f1, f2;
template <typename F> struct Floats3 // like QVector3D
F f1, f2, f3;
template <typename F> struct Floats4 // like QRectF, QVector4D
F f1, f2, f3, f4;
template <typename F> struct Matrix4x4 // like QGenericMatrix<4, 4>
F m[4][4];
struct QChar
unsigned short ucs;
struct QLatin1String
const char *str;
int len;
template <typename F> struct QMatrix
F _m11, _m12, _m21, _m22, _dx, _dy;
template <typename F> struct QMatrix4x4 // like QMatrix4x4
F m[4][4];
int f;
template <typename T> void externalFunction(T);
template <typename T> void passOne()
template <typename T> T externalReturningFunction();
template <typename T> void returnOne()
// C++11 explicit template instantiation
template void passOne<Pointers2>();
template void passOne<Pointers4>();
template void passOne<Integers2>();
template void passOne<Integers4>();
template void passOne<Floats2<float> >();
template void passOne<Floats2<double> >();
template void passOne<Floats3<float> >();
template void passOne<Floats3<double> >();
template void passOne<Floats4<float> >();
template void passOne<Floats4<double> >();
template void passOne<Matrix4x4<float> >();
template void passOne<Matrix4x4<double> >();
template void passOne<QChar>();
template void passOne<QLatin1String>();
template void passOne<QMatrix<float> >();
template void passOne<QMatrix<double> >();
template void passOne<QMatrix4x4<float> >();
template void passOne<QMatrix4x4<double> >();
template void returnOne<Pointers2>();
template void returnOne<Pointers4>();
template void returnOne<Integers2>();
template void returnOne<Integers4>();
template void returnOne<Floats2<float> >();
template void returnOne<Floats2<double> >();
template void returnOne<Floats3<float> >();
template void returnOne<Floats3<double> >();
template void returnOne<Floats4<float> >();
template void returnOne<Floats4<double> >();
template void returnOne<Matrix4x4<float> >();
template void returnOne<Matrix4x4<double> >();
template void returnOne<QChar>();
template void returnOne<QLatin1String>();
template void returnOne<QMatrix<float> >();
template void returnOne<QMatrix<double> >();
template void returnOne<QMatrix4x4<float> >();
template void returnOne<QMatrix4x4<double> >();
此外,我们对 non-structure 浮点参数发生了什么感兴趣:它们是否被提升?所以我们还将测试以下内容:
void passFloat()
void externalFloat(float, float, float, float);
externalFloat(1.0f, 2.0f, 3.0f, 4.0f);
void passDouble()
void externalDouble(double, double, double, double);
externalDouble(1.0f, 2.0f, 3.0f, 4.0f);
float returnFloat()
return 1.0f;
double returnDouble()
return 1.0;
Analysis of the output
您可能已经注意到我跳过了 old-style 32 位 x86。这是故意的,因为该平台无论如何都不支持通过寄存器传递。我们可以从中得出的唯一结论是:
whether the structures are stored in the stack in the place of the argument, or whether they’re stored elsewhere and it’s passed by pointer
whether single-precision floating-point is promoted to double-precision
此外,我故意忽略它,因为我希望人们开始考虑用于 x86-64 的新 ILP32 ABI,由 GCC 4.7 的 -mx32 开关启用,它遵循与下面描述的相同的 ABI (指针是 32 位的除外)。
Pointers2 is passed in registers;
Pointers4 is passed in memory;
Integers2 is passed in a single register (two 32-bit values per 64-bit register);
Integers4 is passed in two registers only (two 32-bit values per 64-bit register);
Floats2<float> is passed packed into a single SSE register, no promotion to double
Floats3<float> is passed packed into two SSE registers, no promotion to double;
Floats4<float> is passed packed into two SSE registers, no promotion to double;
Floats2<double> is passed in two SSE registers, one value per register
Floats3<double> and Floats4<double> are passed in memory;
Matrix4x4 and QMatrix4x4 are passed in memory regardless of the underlying type;
QChar is passed in a register;
QLatin1String is passed in registers.
The floating point parameters are passed one per register, without float promotion to double.
对于return值,结论同上:如果值是在寄存器中传递的,那么它也是return在寄存器中;如果它在内存中传递,它会在内存中 returned 。通过仔细阅读 ABI 文档,我们得出以下结论:
Single-precision floating-point types are not promoted to double;
Single-precision floating-point types in a structure are packed into SSE registers if they are still available
Structures bigger than 16 bytes are passed in memory, with an exception for __m256, the type corresponding to one AVX 256-bit register.
Both Pointers structures are passed in registers, one pointer per register;
Both Integers structures are passed in registers, packed like x86-64 (two ints per register);
All of the Floats structures are passed in registers, one value per register (unpacked);
QMatrix4x4<float> is passed entirely in registers: half of it (the first 8 floats) are in floating-point registers, one value per register (unpacked); the other half is passed in integer registers out4 to out7 as the memory representations (packed);
QMatrix4x4<double> is passed partly in registers: half of it (the first 8 doubles) are in floating-point registers, one value per register (unpacked); the other half is passed in memory;
QChar and QLatin1String are passed in registers;
Both QMatrix are passed entirely in registers, one value per register (unpacked);
QMatrix4x4 is passed like Matrix4x4, except that the integer is always in memory (the structure is larger than 8*8 bytes);
Individual floating-point parameters are passed one per register; type promotion happens internally in the register.
对于 return 个值,我们有:
The floating-point structures with up to 8 floating-point members are returned in registers;
The integer structures of up to 32 bytes are returned in registers;
All the rest is returned in memory supplied by the caller.
Type promotion happens in hardware, as IA-64 does not have specific registers for single or double precision (is FP registers hold only extended precision data);
Homogeneous structures of floating-point types are passed in registers, up to 8 values; the rest goes to the integer registers if there are some still available or in memory;
All other structures are passed in the integer registers, up to 64 bytes;
Integer registers are allocated for passing any and all types, even if they aren't used (the ABI says they should be used if in the case of C without prototypes).
我只为 ARMv7 编译了代码,floating-point 参数在 VFP 寄存器中传递。如果您正在阅读此博客,您可能对性能感兴趣,因此您必须使用 ARM 的 "hard-float" 模型。我不会关心较慢的 "soft-float" 模式。另请注意,这仅适用于 ARMv7:ARMv8 64 位 (AArch64) 规则略有不同,但没有可用的编译器。
Pointers2, Pointers4, Integers2, and Integers4 are passed in registers (note that the Pointers and Integers structures are the same in 32-bit mode);
All of the Float types are passed in registers, one value per register, without promotion of floats to doubles; the values are also stored in memory but I can't tell if this is required or just GCC being dumb;
All types of Matrix4x4, QMatrix and QMatrix4x4 are passed in both memory and registers, which contains the first 16 bytes;
QChar and QLatin1String are passed in registers;
are passed in memory regardless of the underlying type.
The floating point parameters are passed one per register, without float promotion to double.
对于 returning 这些类型,我们有:
All of the Float types are returned in registers and GCC then stores them all to memory even if they are never used afterwards;
QChar is returned in a register;
Everything else is returned in memory.
请注意,return 类型是 32 位 AAPCS 与 64 位 AAPCS 不同的地方之一:在那里,如果类型在寄存器中传递给第一个函数参数,它在那些相同的寄存器中被 returned。 32 位 AAPCS 将 return-in-registers 限制为 4 字节或更少的结构。
Single-precision floating-point types are not promoted to double;
Homogeneous structures (that is, structures containing one single type) of a floating-point type are passed in floating-point registers if the structure has 4 members or fewer;
我尝试了 MIPS 32 位构建(使用 GCC-default o32 ABI)和 MIPS 64 位构建(使用 -mabi=o64 -mlong64)。除非另有说明,否则两种架构的结果相同。
Both types of Integers and Pointers structures are passed in registers; on 64-bit, two 32-bit integers are packed into a single 64-bit register like x86-64;
Float2<float>, Float3<float>, and Float4<float> are passed in integer registers, not on the floating-point registers; on 64-bit, two floats are packed into a single 64-bit register;
Float2<double> is passed in integer registers; on 32-bit, two 32-bit registers are required to store each double;
On 32-bit, the first two doubles of Float3<double> and Float3<double> are passed in integer registers, the rest are passed in memory;
On 64-bit, Float3<double> and Float3<double> are passed entirely in integer registers;
Matrix4x4, QMatrix, and QMatrix4x4 are passed in integer registers (the portion that fits) and in memory (the rest);
QChar is passed in a register (on MIPS big-endian, it's passed on bits 16-31);
QLatin1String is passed on two registers;
The floating point parameters are passed one per register, without float promotion to double.
对于 return 值,MIPS 很简单:一切都 return 在内存中编辑,甚至是 QChar。
No float is promoted to double;
No structure is ever passed in floating-point registers;
No structure is ever returned in registers.
我们能得出的综合结论很少。其中之一是 single-precision 浮点值在存在形式参数时未显式提升为双精度值。自动提升可能只发生在省略号 (...) 中传递的 floating-point 值,但我们的问题陈述是关于调用已知参数的函数。唯一与规则略有不同的是 IA-64,但这并不重要,因为硬件(如 x87)仅在一种模式下运行。
对于包含整数参数(包括指针)的结构,没有什么可以进一步优化的:它们完全按照它们在内存中出现的方式加载到寄存器中。这意味着对应于填充的寄存器部分可能包含未初始化或垃圾数据,或者它可能在 big-endian 模式下制作一些非常奇怪的东西,比如 MIPS。这也意味着,在所有体系结构上,小于寄存器的类型不会占用整个寄存器,因此它们可能与其他成员一起打包。
为了继续得出结论,我们需要排除 MIPS,因为它传递整数寄存器中的所有内容,并且 return 通过内存传递所有内容。如果这样做,我们将能够看到所有 ABI 都为仅包含一种 floating-point 类型的结构提供了优化。在 ABI 文档中,它们的名称略有不同,均表示同类 floating-point 结构。这些优化意味着结构在某些条件下传递给 floating-point 寄存器。
第一个破解的其实是x86-64:上限是16字节,限制为两个SSE寄存器。这样做的理由似乎是传递一个 double-precision 复杂值,它占用 16 个字节。我们能够传递四个 single-precision 值是一个意想不到的好处。
其余体系结构(ARM 和 IA-64)可以通过寄存器传递更多值,并且每个寄存器始终传递一个值(无打包)。 IA-64有更多的专门用于参数传递的寄存器,所以比ARM可以传递更多。
Structures of up to 16 bytes containing integers and pointers should be passed by value;
Homogeneous structures of up to 16 bytes containing floating-point should be passed by value (2 doubles or 4 floats);
Mixed-type structures should be avoided; if they exist, passing by value is still a good idea;
以上仅对trivially-copiable和trivially-destrucitble结构有效。所有 C 结构(C++ 中的 POD)都符合这些标准。
我应该注意到,上面的建议并不总能产生更高效的代码。尽管值可以在寄存器中传递,但我测试的每个编译器(GCC 4.6、Clang 3.0、ICC 12.1)在某些情况下仍然会进行大量内存操作。编译器将结构写入内存然后将其加载到寄存器中是很常见的。当它这样做时,通过常量引用传递会更有效,因为它会用堆栈指针上的算术代替内存加载。
然而,这些只是编译器团队进一步优化工作的问题。我针对 x86-64 测试的三个编译器进行了不同的优化,在几乎所有情况下,至少其中一个编译器在没有内存访问的情况下成功运行。有趣的是,当我们用零替换填充 space 时,行为也会发生变化。
我正在阅读 Scott meyers 的 Effective C++,作者正在比较按值传递和按引用传递。对于用户定义的类型,建议使用按引用传递,对于内置类型按值传递。我正在寻找一个示例来解释以下段落,即即使对于小型用户定义的对象,按值传递的状态也可能代价高昂。
Built-in types are small, so some people conclude that all small types are good candidates for pass-by-value, even if they’re user-defined. This is shaky reasoning. Just because an object is small doesn’t mean that calling its copy constructor is inexpensive. Many objects — most STL containers among them — contain little more than a pointer, but copying such objects entails copying everything they point to. That can be very expensive.
这取决于你的副本是深拷贝还是浅拷贝。(或者value-like class/pointer-like class)。例如,A是一个class只有一个指向另一个对象的指针:
struct B;
struct A
B* pB;
~A{delete pB;}
in a1
指向同一个堆 memory.That 也就是说,dtor ~A()
struct A
B* pB;
const A& operator=(const A&rhs)
delete pB;
pB=new pB;
return *this;
//the copy/move constructor/assignment should also be redefined
~A{delete pB;}
综上所述,如果你的 class 是 trivially copyable,那么复制一个小的 user-defined class,或者按值传递,花费不多,否则视情况而定。
如果你仍然想按值传递并且不想触发未定义的行为,shared_ptr 可能是 you.But 所指出的一个不错的选择@Arne Vogel 提出,shared_ptr 的实现是 thread-safe,这需要对引用计数进行原子操作,这会增加成本。
'cost' 只是浪费了 CPU 个周期。
#include <iostream>
class simple {
simple() { std::cout << "constructor" << std::endl; }
simple(const simple& copy) { std::cout << "copied" << std::endl; }
~simple() { std::cout << "destructor" << std::endl; }
void addr() const { std::cout << &(*this) << std::endl; }
void simple_ref(const simple& ref) { ref.addr(); }
void simple_val(simple val) { val.addr(); }
int main(int argc, char* argv[])
simple val; // output: 'constructor'
simple_ref(val); // output: address of val
simple_val(val); // output: 'copied', address of copy made, 'destructor' (the destructor of the copy made)
return 0;
// output: 'destructor' (destructor of 'val')
这里没有成员数据,所以在我的机器上给出 sizeof(simple)
的输出,给我 1
,但是调用一个函数是 value,而不是 reference,调用一个副本,即使是像打印变量地址这样简单的事情。
(这是关于 copy vs ref 的博文内容,来自 Thiago Macieira,https://www.macieira.org/blog/2012/02/the-value-of-passing-by-value/)
在我们进入 ABI 文档并尝试编译代码之前,我们需要定义我们要解决的问题。一般来说,我正在尝试找到传递小型 C++ 结构的最佳方式:什么时候按值传递比按常量引用传递更好?在这些条件下,qreal 讨论是否有任何重要意义?
像 QLatin1String 这样只包含一个指针作为成员的小型结构将从按值传递中受益。我们还应该关注哪些其他类型的结构?
- 具有多个指针的结构
- 64 位架构上的 32 位整数结构
- 具有 floating-point 的结构(单精度和双精度)
- Mixed-type 和 Qt 中的特殊结构
我将研究 x86-64、ARMv7 hard-float、MIPS hard-float (o32) 和 IA-64 ABI,因为它们是我可以访问编译器的对象。它们都支持通过寄存器传递参数,并且至少有 4 个整数寄存器用于参数传递。除了 MIPS,它们都至少有 4 个 floating-point 寄存器用于参数传递。有关更多信息,请参阅我之前的 ABI 详细信息博客。
struct Pointers2
void *p1, *p2;
struct Pointers4
void *p1, *p2, *p3, *p4;
struct Integers2 // like QSize and QPoint
int i1, i2;
struct Integers4 // like QRect
int i1, i2, i3, i4;
template <typename F> struct Floats2 // like QSizeF, QPointF, QVector2D
F f1, f2;
template <typename F> struct Floats3 // like QVector3D
F f1, f2, f3;
template <typename F> struct Floats4 // like QRectF, QVector4D
F f1, f2, f3, f4;
template <typename F> struct Matrix4x4 // like QGenericMatrix<4, 4>
F m[4][4];
struct QChar
unsigned short ucs;
struct QLatin1String
const char *str;
int len;
template <typename F> struct QMatrix
F _m11, _m12, _m21, _m22, _dx, _dy;
template <typename F> struct QMatrix4x4 // like QMatrix4x4
F m[4][4];
int f;
template <typename T> void externalFunction(T);
template <typename T> void passOne()
template <typename T> T externalReturningFunction();
template <typename T> void returnOne()
// C++11 explicit template instantiation
template void passOne<Pointers2>();
template void passOne<Pointers4>();
template void passOne<Integers2>();
template void passOne<Integers4>();
template void passOne<Floats2<float> >();
template void passOne<Floats2<double> >();
template void passOne<Floats3<float> >();
template void passOne<Floats3<double> >();
template void passOne<Floats4<float> >();
template void passOne<Floats4<double> >();
template void passOne<Matrix4x4<float> >();
template void passOne<Matrix4x4<double> >();
template void passOne<QChar>();
template void passOne<QLatin1String>();
template void passOne<QMatrix<float> >();
template void passOne<QMatrix<double> >();
template void passOne<QMatrix4x4<float> >();
template void passOne<QMatrix4x4<double> >();
template void returnOne<Pointers2>();
template void returnOne<Pointers4>();
template void returnOne<Integers2>();
template void returnOne<Integers4>();
template void returnOne<Floats2<float> >();
template void returnOne<Floats2<double> >();
template void returnOne<Floats3<float> >();
template void returnOne<Floats3<double> >();
template void returnOne<Floats4<float> >();
template void returnOne<Floats4<double> >();
template void returnOne<Matrix4x4<float> >();
template void returnOne<Matrix4x4<double> >();
template void returnOne<QChar>();
template void returnOne<QLatin1String>();
template void returnOne<QMatrix<float> >();
template void returnOne<QMatrix<double> >();
template void returnOne<QMatrix4x4<float> >();
template void returnOne<QMatrix4x4<double> >();
此外,我们对 non-structure 浮点参数发生了什么感兴趣:它们是否被提升?所以我们还将测试以下内容:
void passFloat()
void externalFloat(float, float, float, float);
externalFloat(1.0f, 2.0f, 3.0f, 4.0f);
void passDouble()
void externalDouble(double, double, double, double);
externalDouble(1.0f, 2.0f, 3.0f, 4.0f);
float returnFloat()
return 1.0f;
double returnDouble()
return 1.0;
Analysis of the output
您可能已经注意到我跳过了 old-style 32 位 x86。这是故意的,因为该平台无论如何都不支持通过寄存器传递。我们可以从中得出的唯一结论是:
whether the structures are stored in the stack in the place of the argument, or whether they’re stored elsewhere and it’s passed by pointer
whether single-precision floating-point is promoted to double-precision
此外,我故意忽略它,因为我希望人们开始考虑用于 x86-64 的新 ILP32 ABI,由 GCC 4.7 的 -mx32 开关启用,它遵循与下面描述的相同的 ABI (指针是 32 位的除外)。
Pointers2 is passed in registers;
Pointers4 is passed in memory;
Integers2 is passed in a single register (two 32-bit values per 64-bit register);
Integers4 is passed in two registers only (two 32-bit values per 64-bit register);
Floats2<float> is passed packed into a single SSE register, no promotion to double
Floats3<float> is passed packed into two SSE registers, no promotion to double;
Floats4<float> is passed packed into two SSE registers, no promotion to double;
Floats2<double> is passed in two SSE registers, one value per register
Floats3<double> and Floats4<double> are passed in memory;
Matrix4x4 and QMatrix4x4 are passed in memory regardless of the underlying type;
QChar is passed in a register;
QLatin1String is passed in registers.
The floating point parameters are passed one per register, without float promotion to double.
对于return值,结论同上:如果值是在寄存器中传递的,那么它也是return在寄存器中;如果它在内存中传递,它会在内存中 returned 。通过仔细阅读 ABI 文档,我们得出以下结论:
Single-precision floating-point types are not promoted to double;
Single-precision floating-point types in a structure are packed into SSE registers if they are still available
Structures bigger than 16 bytes are passed in memory, with an exception for __m256, the type corresponding to one AVX 256-bit register.
Both Pointers structures are passed in registers, one pointer per register;
Both Integers structures are passed in registers, packed like x86-64 (two ints per register);
All of the Floats structures are passed in registers, one value per register (unpacked);
QMatrix4x4<float> is passed entirely in registers: half of it (the first 8 floats) are in floating-point registers, one value per register (unpacked); the other half is passed in integer registers out4 to out7 as the memory representations (packed);
QMatrix4x4<double> is passed partly in registers: half of it (the first 8 doubles) are in floating-point registers, one value per register (unpacked); the other half is passed in memory;
QChar and QLatin1String are passed in registers;
Both QMatrix are passed entirely in registers, one value per register (unpacked);
QMatrix4x4 is passed like Matrix4x4, except that the integer is always in memory (the structure is larger than 8*8 bytes);
Individual floating-point parameters are passed one per register; type promotion happens internally in the register.
对于 return 个值,我们有:
The floating-point structures with up to 8 floating-point members are returned in registers;
The integer structures of up to 32 bytes are returned in registers;
All the rest is returned in memory supplied by the caller.
Type promotion happens in hardware, as IA-64 does not have specific registers for single or double precision (is FP registers hold only extended precision data);
Homogeneous structures of floating-point types are passed in registers, up to 8 values; the rest goes to the integer registers if there are some still available or in memory;
All other structures are passed in the integer registers, up to 64 bytes;
Integer registers are allocated for passing any and all types, even if they aren't used (the ABI says they should be used if in the case of C without prototypes).
我只为 ARMv7 编译了代码,floating-point 参数在 VFP 寄存器中传递。如果您正在阅读此博客,您可能对性能感兴趣,因此您必须使用 ARM 的 "hard-float" 模型。我不会关心较慢的 "soft-float" 模式。另请注意,这仅适用于 ARMv7:ARMv8 64 位 (AArch64) 规则略有不同,但没有可用的编译器。
Pointers2, Pointers4, Integers2, and Integers4 are passed in registers (note that the Pointers and Integers structures are the same in 32-bit mode);
All of the Float types are passed in registers, one value per register, without promotion of floats to doubles; the values are also stored in memory but I can't tell if this is required or just GCC being dumb;
All types of Matrix4x4, QMatrix and QMatrix4x4 are passed in both memory and registers, which contains the first 16 bytes;
QChar and QLatin1String are passed in registers;
are passed in memory regardless of the underlying type.
The floating point parameters are passed one per register, without float promotion to double.
对于 returning 这些类型,我们有:
All of the Float types are returned in registers and GCC then stores them all to memory even if they are never used afterwards;
QChar is returned in a register;
Everything else is returned in memory.
请注意,return 类型是 32 位 AAPCS 与 64 位 AAPCS 不同的地方之一:在那里,如果类型在寄存器中传递给第一个函数参数,它在那些相同的寄存器中被 returned。 32 位 AAPCS 将 return-in-registers 限制为 4 字节或更少的结构。
Single-precision floating-point types are not promoted to double;
Homogeneous structures (that is, structures containing one single type) of a floating-point type are passed in floating-point registers if the structure has 4 members or fewer;
我尝试了 MIPS 32 位构建(使用 GCC-default o32 ABI)和 MIPS 64 位构建(使用 -mabi=o64 -mlong64)。除非另有说明,否则两种架构的结果相同。
Both types of Integers and Pointers structures are passed in registers; on 64-bit, two 32-bit integers are packed into a single 64-bit register like x86-64;
Float2<float>, Float3<float>, and Float4<float> are passed in integer registers, not on the floating-point registers; on 64-bit, two floats are packed into a single 64-bit register;
Float2<double> is passed in integer registers; on 32-bit, two 32-bit registers are required to store each double;
On 32-bit, the first two doubles of Float3<double> and Float3<double> are passed in integer registers, the rest are passed in memory;
On 64-bit, Float3<double> and Float3<double> are passed entirely in integer registers;
Matrix4x4, QMatrix, and QMatrix4x4 are passed in integer registers (the portion that fits) and in memory (the rest);
QChar is passed in a register (on MIPS big-endian, it's passed on bits 16-31);
QLatin1String is passed on two registers;
The floating point parameters are passed one per register, without float promotion to double.
对于 return 值,MIPS 很简单:一切都 return 在内存中编辑,甚至是 QChar。
No float is promoted to double;
No structure is ever passed in floating-point registers;
No structure is ever returned in registers.
我们能得出的综合结论很少。其中之一是 single-precision 浮点值在存在形式参数时未显式提升为双精度值。自动提升可能只发生在省略号 (...) 中传递的 floating-point 值,但我们的问题陈述是关于调用已知参数的函数。唯一与规则略有不同的是 IA-64,但这并不重要,因为硬件(如 x87)仅在一种模式下运行。
对于包含整数参数(包括指针)的结构,没有什么可以进一步优化的:它们完全按照它们在内存中出现的方式加载到寄存器中。这意味着对应于填充的寄存器部分可能包含未初始化或垃圾数据,或者它可能在 big-endian 模式下制作一些非常奇怪的东西,比如 MIPS。这也意味着,在所有体系结构上,小于寄存器的类型不会占用整个寄存器,因此它们可能与其他成员一起打包。
为了继续得出结论,我们需要排除 MIPS,因为它传递整数寄存器中的所有内容,并且 return 通过内存传递所有内容。如果这样做,我们将能够看到所有 ABI 都为仅包含一种 floating-point 类型的结构提供了优化。在 ABI 文档中,它们的名称略有不同,均表示同类 floating-point 结构。这些优化意味着结构在某些条件下传递给 floating-point 寄存器。
第一个破解的其实是x86-64:上限是16字节,限制为两个SSE寄存器。这样做的理由似乎是传递一个 double-precision 复杂值,它占用 16 个字节。我们能够传递四个 single-precision 值是一个意想不到的好处。
其余体系结构(ARM 和 IA-64)可以通过寄存器传递更多值,并且每个寄存器始终传递一个值(无打包)。 IA-64有更多的专门用于参数传递的寄存器,所以比ARM可以传递更多。 代码推荐
Structures of up to 16 bytes containing integers and pointers should be passed by value;
Homogeneous structures of up to 16 bytes containing floating-point should be passed by value (2 doubles or 4 floats);
Mixed-type structures should be avoided; if they exist, passing by value is still a good idea;
以上仅对trivially-copiable和trivially-destrucitble结构有效。所有 C 结构(C++ 中的 POD)都符合这些标准。 最后说明
我应该注意到,上面的建议并不总能产生更高效的代码。尽管值可以在寄存器中传递,但我测试的每个编译器(GCC 4.6、Clang 3.0、ICC 12.1)在某些情况下仍然会进行大量内存操作。编译器将结构写入内存然后将其加载到寄存器中是很常见的。当它这样做时,通过常量引用传递会更有效,因为它会用堆栈指针上的算术代替内存加载。
然而,这些只是编译器团队进一步优化工作的问题。我针对 x86-64 测试的三个编译器进行了不同的优化,在几乎所有情况下,至少其中一个编译器在没有内存访问的情况下成功运行。有趣的是,当我们用零替换填充 space 时,行为也会发生变化。