支持按值传递语句的示例即使对于小型用户定义类型也不是好的做法
Example to support the statement pass by value is not good practice even for small user defined types
我正在阅读 Scott meyers 的 Effective C++,作者正在比较按值传递和按引用传递。对于用户定义的类型,建议使用按引用传递,对于内置类型按值传递。我正在寻找一个示例来解释以下段落,即即使对于小型用户定义的对象,按值传递的状态也可能代价高昂。
Built-in types are small, so some people conclude that all small types
are good candidates for pass-by-value, even if they’re user-defined.
This is shaky reasoning. Just because an object is small doesn’t mean
that calling its copy constructor is inexpensive. Many objects — most
STL containers among them — contain little more than a pointer, but
copying such objects entails copying everything they point to. That
can be very expensive.
这取决于你的副本是深拷贝还是浅拷贝。(或者value-like class/pointer-like class)。例如,A是一个class只有一个指向另一个对象的指针:
struct B;
struct A
{
B* pB;
~A{delete pB;}
}a1,a2;
如果你按值复制A
,像a1=a2
,默认的按位复制赋值将被调用,这是很小的成本,但是,通过这样做你会让pB
in a1
,a2
指向同一个堆 memory.That 也就是说,dtor ~A()
可能会被调用两次,这是未定义的行为。
所以我们必须这样做:
struct A
{
B* pB;
const A& operator=(const A&rhs)
{
if(this!=&rhs)
{
delete pB;
pB=new pB;
*pB=*rhs.pB;
}
return *this;
}
//the copy/move constructor/assignment should also be redefined
~A{delete pB;}
}a1,a2
上面的代码片段将调用B
的复制赋值,这可能非常昂贵。
综上所述,如果你的 class 是 trivially copyable,那么复制一个小的 user-defined class,或者按值传递,花费不多,否则视情况而定。
如果你仍然想按值传递并且不想触发未定义的行为,shared_ptr 可能是 you.But 所指出的一个不错的选择@Arne Vogel 提出,shared_ptr 的实现是 thread-safe,这需要对引用计数进行原子操作,这会增加成本。
'cost' 只是浪费了 CPU 个周期。
举个简单的例子:
#include <iostream>
class simple {
public:
simple() { std::cout << "constructor" << std::endl; }
simple(const simple& copy) { std::cout << "copied" << std::endl; }
~simple() { std::cout << "destructor" << std::endl; }
void addr() const { std::cout << &(*this) << std::endl; }
};
void simple_ref(const simple& ref) { ref.addr(); }
void simple_val(simple val) { val.addr(); }
int main(int argc, char* argv[])
{
simple val; // output: 'constructor'
simple_ref(val); // output: address of val
simple_val(val); // output: 'copied', address of copy made, 'destructor' (the destructor of the copy made)
return 0;
// output: 'destructor' (destructor of 'val')
}
这里没有成员数据,所以在我的机器上给出 sizeof(simple)
的输出,给我 1
,但是调用一个函数是 value,而不是 reference,调用一个副本,即使是像打印变量地址这样简单的事情。
这是一种设计考虑,因为它可能是您想要的东西,但像这样复制内存的成本很高,而且可能完全没有必要,尤其是在上面的示例中。
希望能帮到你。
好的,复制数据既昂贵又不必要。
但另一方面,引用变量的函数不是线程安全的。除非操作是原子的,否则有时更多的做法是复制变量以避免并发线程的任何突变。
(这是关于 copy vs ref 的博文内容,来自 Thiago Macieira,https://www.macieira.org/blog/2012/02/the-value-of-passing-by-value/)
问题陈述
在我们进入 ABI 文档并尝试编译代码之前,我们需要定义我们要解决的问题。一般来说,我正在尝试找到传递小型 C++ 结构的最佳方式:什么时候按值传递比按常量引用传递更好?在这些条件下,qreal 讨论是否有任何重要意义?
像 QLatin1String 这样只包含一个指针作为成员的小型结构将从按值传递中受益。我们还应该关注哪些其他类型的结构?
- 具有多个指针的结构
- 64 位架构上的 32 位整数结构
- 具有 floating-point 的结构(单精度和双精度)
- Mixed-type 和 Qt 中的特殊结构
我将研究 x86-64、ARMv7 hard-float、MIPS hard-float (o32) 和 IA-64 ABI,因为它们是我可以访问编译器的对象。它们都支持通过寄存器传递参数,并且至少有 4 个整数寄存器用于参数传递。除了 MIPS,它们都至少有 4 个 floating-point 寄存器用于参数传递。有关更多信息,请参阅我之前的 ABI 详细信息博客。
所以我们将研究当您按值传递以下结构时会发生什么:
struct Pointers2
{
void *p1, *p2;
};
struct Pointers4
{
void *p1, *p2, *p3, *p4;
};
struct Integers2 // like QSize and QPoint
{
int i1, i2;
};
struct Integers4 // like QRect
{
int i1, i2, i3, i4;
};
template <typename F> struct Floats2 // like QSizeF, QPointF, QVector2D
{
F f1, f2;
};
template <typename F> struct Floats3 // like QVector3D
{
F f1, f2, f3;
};
template <typename F> struct Floats4 // like QRectF, QVector4D
{
F f1, f2, f3, f4;
};
template <typename F> struct Matrix4x4 // like QGenericMatrix<4, 4>
{
F m[4][4];
};
struct QChar
{
unsigned short ucs;
};
struct QLatin1String
{
const char *str;
int len;
};
template <typename F> struct QMatrix
{
F _m11, _m12, _m21, _m22, _dx, _dy;
};
template <typename F> struct QMatrix4x4 // like QMatrix4x4
{
F m[4][4];
int f;
};
然后我们将分析以下程序的汇编:
template <typename T> void externalFunction(T);
template <typename T> void passOne()
{
externalFunction(T());
}
template <typename T> T externalReturningFunction();
template <typename T> void returnOne()
{
externalReturningFunction<T>();
}
// C++11 explicit template instantiation
template void passOne<Pointers2>();
template void passOne<Pointers4>();
template void passOne<Integers2>();
template void passOne<Integers4>();
template void passOne<Floats2<float> >();
template void passOne<Floats2<double> >();
template void passOne<Floats3<float> >();
template void passOne<Floats3<double> >();
template void passOne<Floats4<float> >();
template void passOne<Floats4<double> >();
template void passOne<Matrix4x4<float> >();
template void passOne<Matrix4x4<double> >();
template void passOne<QChar>();
template void passOne<QLatin1String>();
template void passOne<QMatrix<float> >();
template void passOne<QMatrix<double> >();
template void passOne<QMatrix4x4<float> >();
template void passOne<QMatrix4x4<double> >();
template void returnOne<Pointers2>();
template void returnOne<Pointers4>();
template void returnOne<Integers2>();
template void returnOne<Integers4>();
template void returnOne<Floats2<float> >();
template void returnOne<Floats2<double> >();
template void returnOne<Floats3<float> >();
template void returnOne<Floats3<double> >();
template void returnOne<Floats4<float> >();
template void returnOne<Floats4<double> >();
template void returnOne<Matrix4x4<float> >();
template void returnOne<Matrix4x4<double> >();
template void returnOne<QChar>();
template void returnOne<QLatin1String>();
template void returnOne<QMatrix<float> >();
template void returnOne<QMatrix<double> >();
template void returnOne<QMatrix4x4<float> >();
template void returnOne<QMatrix4x4<double> >();
此外,我们对 non-structure 浮点参数发生了什么感兴趣:它们是否被提升?所以我们还将测试以下内容:
void passFloat()
{
void externalFloat(float, float, float, float);
externalFloat(1.0f, 2.0f, 3.0f, 4.0f);
}
void passDouble()
{
void externalDouble(double, double, double, double);
externalDouble(1.0f, 2.0f, 3.0f, 4.0f);
}
float returnFloat()
{
return 1.0f;
}
double returnDouble()
{
return 1.0;
}
Analysis of the output
x86-64
您可能已经注意到我跳过了 old-style 32 位 x86。这是故意的,因为该平台无论如何都不支持通过寄存器传递。我们可以从中得出的唯一结论是:
whether the structures are stored in the stack in the place of the argument, or whether they’re stored elsewhere and it’s passed by pointer
whether single-precision floating-point is promoted to double-precision
此外,我故意忽略它,因为我希望人们开始考虑用于 x86-64 的新 ILP32 ABI,由 GCC 4.7 的 -mx32 开关启用,它遵循与下面描述的相同的 ABI (指针是 32 位的除外)。
那么让我们来看看组装结果吧。对于参数传递,我们发现
Pointers2 is passed in registers;
Pointers4 is passed in memory;
Integers2 is passed in a single register (two 32-bit values per 64-bit register);
Integers4 is passed in two registers only (two 32-bit values per 64-bit register);
Floats2<float> is passed packed into a single SSE register, no promotion to double
Floats3<float> is passed packed into two SSE registers, no promotion to double;
Floats4<float> is passed packed into two SSE registers, no promotion to double;
Floats2<double> is passed in two SSE registers, one value per register
Floats3<double> and Floats4<double> are passed in memory;
Matrix4x4 and QMatrix4x4 are passed in memory regardless of the underlying type;
QChar is passed in a register;
QLatin1String is passed in registers.
The floating point parameters are passed one per register, without float promotion to double.
对于return值,结论同上:如果值是在寄存器中传递的,那么它也是return在寄存器中;如果它在内存中传递,它会在内存中 returned 。通过仔细阅读 ABI 文档,我们得出以下结论:
Single-precision floating-point types are not promoted to double;
Single-precision floating-point types in a structure are packed into SSE registers if they are still available
Structures bigger than 16 bytes are passed in memory, with an exception for __m256, the type corresponding to one AVX 256-bit register.
IA-64
参数传递的结果如下:
Both Pointers structures are passed in registers, one pointer per register;
Both Integers structures are passed in registers, packed like x86-64 (two ints per register);
All of the Floats structures are passed in registers, one value per register (unpacked);
QMatrix4x4<float> is passed entirely in registers: half of it (the first 8 floats) are in floating-point registers, one value per register (unpacked); the other half is passed in integer registers out4 to out7 as the memory representations (packed);
QMatrix4x4<double> is passed partly in registers: half of it (the first 8 doubles) are in floating-point registers, one value per register (unpacked); the other half is passed in memory;
QChar and QLatin1String are passed in registers;
Both QMatrix are passed entirely in registers, one value per register (unpacked);
QMatrix4x4 is passed like Matrix4x4, except that the integer is always in memory (the structure is larger than 8*8 bytes);
Individual floating-point parameters are passed one per register; type promotion happens internally in the register.
对于 return 个值,我们有:
The floating-point structures with up to 8 floating-point members are returned in registers;
The integer structures of up to 32 bytes are returned in registers;
All the rest is returned in memory supplied by the caller.
结论是:
Type promotion happens in hardware, as IA-64 does not have specific registers for single or double precision (is FP registers hold only extended precision data);
Homogeneous structures of floating-point types are passed in registers, up to 8 values; the rest goes to the integer registers if there are some still available or in memory;
All other structures are passed in the integer registers, up to 64 bytes;
Integer registers are allocated for passing any and all types, even if they aren't used (the ABI says they should be used if in the case of C without prototypes).
手臂
我只为 ARMv7 编译了代码,floating-point 参数在 VFP 寄存器中传递。如果您正在阅读此博客,您可能对性能感兴趣,因此您必须使用 ARM 的 "hard-float" 模型。我不会关心较慢的 "soft-float" 模式。另请注意,这仅适用于 ARMv7:ARMv8 64 位 (AArch64) 规则略有不同,但没有可用的编译器。
参数传递的结果如下:
Pointers2, Pointers4, Integers2, and Integers4 are passed in registers (note that the Pointers and Integers structures are the same in 32-bit mode);
All of the Float types are passed in registers, one value per register, without promotion of floats to doubles; the values are also stored in memory but I can't tell if this is required or just GCC being dumb;
All types of Matrix4x4, QMatrix and QMatrix4x4 are passed in both memory and registers, which contains the first 16 bytes;
QChar and QLatin1String are passed in registers;
are passed in memory regardless of the underlying type.
The floating point parameters are passed one per register, without float promotion to double.
对于 returning 这些类型,我们有:
All of the Float types are returned in registers and GCC then stores them all to memory even if they are never used afterwards;
QChar is returned in a register;
Everything else is returned in memory.
请注意,return 类型是 32 位 AAPCS 与 64 位 AAPCS 不同的地方之一:在那里,如果类型在寄存器中传递给第一个函数参数,它在那些相同的寄存器中被 returned。 32 位 AAPCS 将 return-in-registers 限制为 4 字节或更少的结构。
我的结论是:
Single-precision floating-point types are not promoted to double;
Homogeneous structures (that is, structures containing one single type) of a floating-point type are passed in floating-point registers if the structure has 4 members or fewer;
MIPS
我尝试了 MIPS 32 位构建(使用 GCC-default o32 ABI)和 MIPS 64 位构建(使用 -mabi=o64 -mlong64)。除非另有说明,否则两种架构的结果相同。
对于传递参数,它们是:
Both types of Integers and Pointers structures are passed in registers; on 64-bit, two 32-bit integers are packed into a single 64-bit register like x86-64;
Float2<float>, Float3<float>, and Float4<float> are passed in integer registers, not on the floating-point registers; on 64-bit, two floats are packed into a single 64-bit register;
Float2<double> is passed in integer registers; on 32-bit, two 32-bit registers are required to store each double;
On 32-bit, the first two doubles of Float3<double> and Float3<double> are passed in integer registers, the rest are passed in memory;
On 64-bit, Float3<double> and Float3<double> are passed entirely in integer registers;
Matrix4x4, QMatrix, and QMatrix4x4 are passed in integer registers (the portion that fits) and in memory (the rest);
QChar is passed in a register (on MIPS big-endian, it's passed on bits 16-31);
QLatin1String is passed on two registers;
The floating point parameters are passed one per register, without float promotion to double.
对于 return 值,MIPS 很简单:一切都 return 在内存中编辑,甚至是 QChar。
结论更简单:
No float is promoted to double;
No structure is ever passed in floating-point registers;
No structure is ever returned in registers.
总的结论
我们能得出的综合结论很少。其中之一是 single-precision 浮点值在存在形式参数时未显式提升为双精度值。自动提升可能只发生在省略号 (...) 中传递的 floating-point 值,但我们的问题陈述是关于调用已知参数的函数。唯一与规则略有不同的是 IA-64,但这并不重要,因为硬件(如 x87)仅在一种模式下运行。
对于包含整数参数(包括指针)的结构,没有什么可以进一步优化的:它们完全按照它们在内存中出现的方式加载到寄存器中。这意味着对应于填充的寄存器部分可能包含未初始化或垃圾数据,或者它可能在 big-endian 模式下制作一些非常奇怪的东西,比如 MIPS。这也意味着,在所有体系结构上,小于寄存器的类型不会占用整个寄存器,因此它们可能与其他成员一起打包。
另一个很明显:包含浮点数的结构比包含双精度数的结构小,因此它们将使用更少的内存或更少的寄存器来传递。
为了继续得出结论,我们需要排除 MIPS,因为它传递整数寄存器中的所有内容,并且 return 通过内存传递所有内容。如果这样做,我们将能够看到所有 ABI 都为仅包含一种 floating-point 类型的结构提供了优化。在 ABI 文档中,它们的名称略有不同,均表示同类 floating-point 结构。这些优化意味着结构在某些条件下传递给 floating-point 寄存器。
第一个破解的其实是x86-64:上限是16字节,限制为两个SSE寄存器。这样做的理由似乎是传递一个 double-precision 复杂值,它占用 16 个字节。我们能够传递四个 single-precision 值是一个意想不到的好处。
其余体系结构(ARM 和 IA-64)可以通过寄存器传递更多值,并且每个寄存器始终传递一个值(无打包)。 IA-64有更多的专门用于参数传递的寄存器,所以比ARM可以传递更多。
代码推荐
Structures of up to 16 bytes containing integers and pointers should be passed by value;
Homogeneous structures of up to 16 bytes containing floating-point should be passed by value (2 doubles or 4 floats);
Mixed-type structures should be avoided; if they exist, passing by value is still a good idea;
以上仅对trivially-copiable和trivially-destrucitble结构有效。所有 C 结构(C++ 中的 POD)都符合这些标准。
最后说明
我应该注意到,上面的建议并不总能产生更高效的代码。尽管值可以在寄存器中传递,但我测试的每个编译器(GCC 4.6、Clang 3.0、ICC 12.1)在某些情况下仍然会进行大量内存操作。编译器将结构写入内存然后将其加载到寄存器中是很常见的。当它这样做时,通过常量引用传递会更有效,因为它会用堆栈指针上的算术代替内存加载。
然而,这些只是编译器团队进一步优化工作的问题。我针对 x86-64 测试的三个编译器进行了不同的优化,在几乎所有情况下,至少其中一个编译器在没有内存访问的情况下成功运行。有趣的是,当我们用零替换填充 space 时,行为也会发生变化。
我正在阅读 Scott meyers 的 Effective C++,作者正在比较按值传递和按引用传递。对于用户定义的类型,建议使用按引用传递,对于内置类型按值传递。我正在寻找一个示例来解释以下段落,即即使对于小型用户定义的对象,按值传递的状态也可能代价高昂。
Built-in types are small, so some people conclude that all small types are good candidates for pass-by-value, even if they’re user-defined. This is shaky reasoning. Just because an object is small doesn’t mean that calling its copy constructor is inexpensive. Many objects — most STL containers among them — contain little more than a pointer, but copying such objects entails copying everything they point to. That can be very expensive.
这取决于你的副本是深拷贝还是浅拷贝。(或者value-like class/pointer-like class)。例如,A是一个class只有一个指向另一个对象的指针:
struct B;
struct A
{
B* pB;
~A{delete pB;}
}a1,a2;
如果你按值复制A
,像a1=a2
,默认的按位复制赋值将被调用,这是很小的成本,但是,通过这样做你会让pB
in a1
,a2
指向同一个堆 memory.That 也就是说,dtor ~A()
可能会被调用两次,这是未定义的行为。
所以我们必须这样做:
struct A
{
B* pB;
const A& operator=(const A&rhs)
{
if(this!=&rhs)
{
delete pB;
pB=new pB;
*pB=*rhs.pB;
}
return *this;
}
//the copy/move constructor/assignment should also be redefined
~A{delete pB;}
}a1,a2
上面的代码片段将调用B
的复制赋值,这可能非常昂贵。
综上所述,如果你的 class 是 trivially copyable,那么复制一个小的 user-defined class,或者按值传递,花费不多,否则视情况而定。
如果你仍然想按值传递并且不想触发未定义的行为,shared_ptr 可能是 you.But 所指出的一个不错的选择@Arne Vogel 提出,shared_ptr 的实现是 thread-safe,这需要对引用计数进行原子操作,这会增加成本。
'cost' 只是浪费了 CPU 个周期。
举个简单的例子:
#include <iostream>
class simple {
public:
simple() { std::cout << "constructor" << std::endl; }
simple(const simple& copy) { std::cout << "copied" << std::endl; }
~simple() { std::cout << "destructor" << std::endl; }
void addr() const { std::cout << &(*this) << std::endl; }
};
void simple_ref(const simple& ref) { ref.addr(); }
void simple_val(simple val) { val.addr(); }
int main(int argc, char* argv[])
{
simple val; // output: 'constructor'
simple_ref(val); // output: address of val
simple_val(val); // output: 'copied', address of copy made, 'destructor' (the destructor of the copy made)
return 0;
// output: 'destructor' (destructor of 'val')
}
这里没有成员数据,所以在我的机器上给出 sizeof(simple)
的输出,给我 1
,但是调用一个函数是 value,而不是 reference,调用一个副本,即使是像打印变量地址这样简单的事情。
这是一种设计考虑,因为它可能是您想要的东西,但像这样复制内存的成本很高,而且可能完全没有必要,尤其是在上面的示例中。
希望能帮到你。
好的,复制数据既昂贵又不必要。
但另一方面,引用变量的函数不是线程安全的。除非操作是原子的,否则有时更多的做法是复制变量以避免并发线程的任何突变。
(这是关于 copy vs ref 的博文内容,来自 Thiago Macieira,https://www.macieira.org/blog/2012/02/the-value-of-passing-by-value/)
问题陈述
在我们进入 ABI 文档并尝试编译代码之前,我们需要定义我们要解决的问题。一般来说,我正在尝试找到传递小型 C++ 结构的最佳方式:什么时候按值传递比按常量引用传递更好?在这些条件下,qreal 讨论是否有任何重要意义?
像 QLatin1String 这样只包含一个指针作为成员的小型结构将从按值传递中受益。我们还应该关注哪些其他类型的结构?
- 具有多个指针的结构
- 64 位架构上的 32 位整数结构
- 具有 floating-point 的结构(单精度和双精度)
- Mixed-type 和 Qt 中的特殊结构
我将研究 x86-64、ARMv7 hard-float、MIPS hard-float (o32) 和 IA-64 ABI,因为它们是我可以访问编译器的对象。它们都支持通过寄存器传递参数,并且至少有 4 个整数寄存器用于参数传递。除了 MIPS,它们都至少有 4 个 floating-point 寄存器用于参数传递。有关更多信息,请参阅我之前的 ABI 详细信息博客。
所以我们将研究当您按值传递以下结构时会发生什么:
struct Pointers2
{
void *p1, *p2;
};
struct Pointers4
{
void *p1, *p2, *p3, *p4;
};
struct Integers2 // like QSize and QPoint
{
int i1, i2;
};
struct Integers4 // like QRect
{
int i1, i2, i3, i4;
};
template <typename F> struct Floats2 // like QSizeF, QPointF, QVector2D
{
F f1, f2;
};
template <typename F> struct Floats3 // like QVector3D
{
F f1, f2, f3;
};
template <typename F> struct Floats4 // like QRectF, QVector4D
{
F f1, f2, f3, f4;
};
template <typename F> struct Matrix4x4 // like QGenericMatrix<4, 4>
{
F m[4][4];
};
struct QChar
{
unsigned short ucs;
};
struct QLatin1String
{
const char *str;
int len;
};
template <typename F> struct QMatrix
{
F _m11, _m12, _m21, _m22, _dx, _dy;
};
template <typename F> struct QMatrix4x4 // like QMatrix4x4
{
F m[4][4];
int f;
};
然后我们将分析以下程序的汇编:
template <typename T> void externalFunction(T);
template <typename T> void passOne()
{
externalFunction(T());
}
template <typename T> T externalReturningFunction();
template <typename T> void returnOne()
{
externalReturningFunction<T>();
}
// C++11 explicit template instantiation
template void passOne<Pointers2>();
template void passOne<Pointers4>();
template void passOne<Integers2>();
template void passOne<Integers4>();
template void passOne<Floats2<float> >();
template void passOne<Floats2<double> >();
template void passOne<Floats3<float> >();
template void passOne<Floats3<double> >();
template void passOne<Floats4<float> >();
template void passOne<Floats4<double> >();
template void passOne<Matrix4x4<float> >();
template void passOne<Matrix4x4<double> >();
template void passOne<QChar>();
template void passOne<QLatin1String>();
template void passOne<QMatrix<float> >();
template void passOne<QMatrix<double> >();
template void passOne<QMatrix4x4<float> >();
template void passOne<QMatrix4x4<double> >();
template void returnOne<Pointers2>();
template void returnOne<Pointers4>();
template void returnOne<Integers2>();
template void returnOne<Integers4>();
template void returnOne<Floats2<float> >();
template void returnOne<Floats2<double> >();
template void returnOne<Floats3<float> >();
template void returnOne<Floats3<double> >();
template void returnOne<Floats4<float> >();
template void returnOne<Floats4<double> >();
template void returnOne<Matrix4x4<float> >();
template void returnOne<Matrix4x4<double> >();
template void returnOne<QChar>();
template void returnOne<QLatin1String>();
template void returnOne<QMatrix<float> >();
template void returnOne<QMatrix<double> >();
template void returnOne<QMatrix4x4<float> >();
template void returnOne<QMatrix4x4<double> >();
此外,我们对 non-structure 浮点参数发生了什么感兴趣:它们是否被提升?所以我们还将测试以下内容:
void passFloat()
{
void externalFloat(float, float, float, float);
externalFloat(1.0f, 2.0f, 3.0f, 4.0f);
}
void passDouble()
{
void externalDouble(double, double, double, double);
externalDouble(1.0f, 2.0f, 3.0f, 4.0f);
}
float returnFloat()
{
return 1.0f;
}
double returnDouble()
{
return 1.0;
}
Analysis of the output
x86-64
您可能已经注意到我跳过了 old-style 32 位 x86。这是故意的,因为该平台无论如何都不支持通过寄存器传递。我们可以从中得出的唯一结论是:
whether the structures are stored in the stack in the place of the argument, or whether they’re stored elsewhere and it’s passed by pointer
whether single-precision floating-point is promoted to double-precision
此外,我故意忽略它,因为我希望人们开始考虑用于 x86-64 的新 ILP32 ABI,由 GCC 4.7 的 -mx32 开关启用,它遵循与下面描述的相同的 ABI (指针是 32 位的除外)。
那么让我们来看看组装结果吧。对于参数传递,我们发现
Pointers2 is passed in registers;
Pointers4 is passed in memory;
Integers2 is passed in a single register (two 32-bit values per 64-bit register);
Integers4 is passed in two registers only (two 32-bit values per 64-bit register);
Floats2<float> is passed packed into a single SSE register, no promotion to double
Floats3<float> is passed packed into two SSE registers, no promotion to double;
Floats4<float> is passed packed into two SSE registers, no promotion to double;
Floats2<double> is passed in two SSE registers, one value per register
Floats3<double> and Floats4<double> are passed in memory;
Matrix4x4 and QMatrix4x4 are passed in memory regardless of the underlying type;
QChar is passed in a register;
QLatin1String is passed in registers.
The floating point parameters are passed one per register, without float promotion to double.
对于return值,结论同上:如果值是在寄存器中传递的,那么它也是return在寄存器中;如果它在内存中传递,它会在内存中 returned 。通过仔细阅读 ABI 文档,我们得出以下结论:
Single-precision floating-point types are not promoted to double;
Single-precision floating-point types in a structure are packed into SSE registers if they are still available
Structures bigger than 16 bytes are passed in memory, with an exception for __m256, the type corresponding to one AVX 256-bit register.
IA-64
参数传递的结果如下:
Both Pointers structures are passed in registers, one pointer per register;
Both Integers structures are passed in registers, packed like x86-64 (two ints per register);
All of the Floats structures are passed in registers, one value per register (unpacked);
QMatrix4x4<float> is passed entirely in registers: half of it (the first 8 floats) are in floating-point registers, one value per register (unpacked); the other half is passed in integer registers out4 to out7 as the memory representations (packed);
QMatrix4x4<double> is passed partly in registers: half of it (the first 8 doubles) are in floating-point registers, one value per register (unpacked); the other half is passed in memory;
QChar and QLatin1String are passed in registers;
Both QMatrix are passed entirely in registers, one value per register (unpacked);
QMatrix4x4 is passed like Matrix4x4, except that the integer is always in memory (the structure is larger than 8*8 bytes);
Individual floating-point parameters are passed one per register; type promotion happens internally in the register.
对于 return 个值,我们有:
The floating-point structures with up to 8 floating-point members are returned in registers;
The integer structures of up to 32 bytes are returned in registers;
All the rest is returned in memory supplied by the caller.
结论是:
Type promotion happens in hardware, as IA-64 does not have specific registers for single or double precision (is FP registers hold only extended precision data);
Homogeneous structures of floating-point types are passed in registers, up to 8 values; the rest goes to the integer registers if there are some still available or in memory;
All other structures are passed in the integer registers, up to 64 bytes;
Integer registers are allocated for passing any and all types, even if they aren't used (the ABI says they should be used if in the case of C without prototypes).
手臂
我只为 ARMv7 编译了代码,floating-point 参数在 VFP 寄存器中传递。如果您正在阅读此博客,您可能对性能感兴趣,因此您必须使用 ARM 的 "hard-float" 模型。我不会关心较慢的 "soft-float" 模式。另请注意,这仅适用于 ARMv7:ARMv8 64 位 (AArch64) 规则略有不同,但没有可用的编译器。
参数传递的结果如下:
Pointers2, Pointers4, Integers2, and Integers4 are passed in registers (note that the Pointers and Integers structures are the same in 32-bit mode);
All of the Float types are passed in registers, one value per register, without promotion of floats to doubles; the values are also stored in memory but I can't tell if this is required or just GCC being dumb;
All types of Matrix4x4, QMatrix and QMatrix4x4 are passed in both memory and registers, which contains the first 16 bytes;
QChar and QLatin1String are passed in registers;
are passed in memory regardless of the underlying type.
The floating point parameters are passed one per register, without float promotion to double.
对于 returning 这些类型,我们有:
All of the Float types are returned in registers and GCC then stores them all to memory even if they are never used afterwards;
QChar is returned in a register;
Everything else is returned in memory.
请注意,return 类型是 32 位 AAPCS 与 64 位 AAPCS 不同的地方之一:在那里,如果类型在寄存器中传递给第一个函数参数,它在那些相同的寄存器中被 returned。 32 位 AAPCS 将 return-in-registers 限制为 4 字节或更少的结构。
我的结论是:
Single-precision floating-point types are not promoted to double;
Homogeneous structures (that is, structures containing one single type) of a floating-point type are passed in floating-point registers if the structure has 4 members or fewer;
MIPS
我尝试了 MIPS 32 位构建(使用 GCC-default o32 ABI)和 MIPS 64 位构建(使用 -mabi=o64 -mlong64)。除非另有说明,否则两种架构的结果相同。
对于传递参数,它们是:
Both types of Integers and Pointers structures are passed in registers; on 64-bit, two 32-bit integers are packed into a single 64-bit register like x86-64;
Float2<float>, Float3<float>, and Float4<float> are passed in integer registers, not on the floating-point registers; on 64-bit, two floats are packed into a single 64-bit register;
Float2<double> is passed in integer registers; on 32-bit, two 32-bit registers are required to store each double;
On 32-bit, the first two doubles of Float3<double> and Float3<double> are passed in integer registers, the rest are passed in memory;
On 64-bit, Float3<double> and Float3<double> are passed entirely in integer registers;
Matrix4x4, QMatrix, and QMatrix4x4 are passed in integer registers (the portion that fits) and in memory (the rest);
QChar is passed in a register (on MIPS big-endian, it's passed on bits 16-31);
QLatin1String is passed on two registers;
The floating point parameters are passed one per register, without float promotion to double.
对于 return 值,MIPS 很简单:一切都 return 在内存中编辑,甚至是 QChar。
结论更简单:
No float is promoted to double;
No structure is ever passed in floating-point registers;
No structure is ever returned in registers.
总的结论
我们能得出的综合结论很少。其中之一是 single-precision 浮点值在存在形式参数时未显式提升为双精度值。自动提升可能只发生在省略号 (...) 中传递的 floating-point 值,但我们的问题陈述是关于调用已知参数的函数。唯一与规则略有不同的是 IA-64,但这并不重要,因为硬件(如 x87)仅在一种模式下运行。
对于包含整数参数(包括指针)的结构,没有什么可以进一步优化的:它们完全按照它们在内存中出现的方式加载到寄存器中。这意味着对应于填充的寄存器部分可能包含未初始化或垃圾数据,或者它可能在 big-endian 模式下制作一些非常奇怪的东西,比如 MIPS。这也意味着,在所有体系结构上,小于寄存器的类型不会占用整个寄存器,因此它们可能与其他成员一起打包。
另一个很明显:包含浮点数的结构比包含双精度数的结构小,因此它们将使用更少的内存或更少的寄存器来传递。
为了继续得出结论,我们需要排除 MIPS,因为它传递整数寄存器中的所有内容,并且 return 通过内存传递所有内容。如果这样做,我们将能够看到所有 ABI 都为仅包含一种 floating-point 类型的结构提供了优化。在 ABI 文档中,它们的名称略有不同,均表示同类 floating-point 结构。这些优化意味着结构在某些条件下传递给 floating-point 寄存器。
第一个破解的其实是x86-64:上限是16字节,限制为两个SSE寄存器。这样做的理由似乎是传递一个 double-precision 复杂值,它占用 16 个字节。我们能够传递四个 single-precision 值是一个意想不到的好处。
其余体系结构(ARM 和 IA-64)可以通过寄存器传递更多值,并且每个寄存器始终传递一个值(无打包)。 IA-64有更多的专门用于参数传递的寄存器,所以比ARM可以传递更多。 代码推荐
Structures of up to 16 bytes containing integers and pointers should be passed by value;
Homogeneous structures of up to 16 bytes containing floating-point should be passed by value (2 doubles or 4 floats);
Mixed-type structures should be avoided; if they exist, passing by value is still a good idea;
以上仅对trivially-copiable和trivially-destrucitble结构有效。所有 C 结构(C++ 中的 POD)都符合这些标准。 最后说明
我应该注意到,上面的建议并不总能产生更高效的代码。尽管值可以在寄存器中传递,但我测试的每个编译器(GCC 4.6、Clang 3.0、ICC 12.1)在某些情况下仍然会进行大量内存操作。编译器将结构写入内存然后将其加载到寄存器中是很常见的。当它这样做时,通过常量引用传递会更有效,因为它会用堆栈指针上的算术代替内存加载。
然而,这些只是编译器团队进一步优化工作的问题。我针对 x86-64 测试的三个编译器进行了不同的优化,在几乎所有情况下,至少其中一个编译器在没有内存访问的情况下成功运行。有趣的是,当我们用零替换填充 space 时,行为也会发生变化。