访问 5 字节结构比 8 字节慢得多
Accessing 5-byte struct is much slower than 8-byte
我有一个代码可以根据另一个小数组中的值更新一个数组。
for (var i = 0; i < result.Length; i++)
{
var c = cards[i];
result[i] -= one[c.C0] + one[c.C1];
}
其中 c
是一个结构,它是表示一副牌中卡片的一对字节。
one
是一个大小为 52 的数组(每副牌有 52 张牌的条目)
我写了一个基准来分析这段代码:
private void TestCards2(int testRepetitions, float[] result, float[] one, Cards[] cards)
{
for (var r = 0; r < testRepetitions; r++)
for (var i = 0; i < result.Length; i++)
{
var c = cards[i];
result[i] -= one[c.C0] + one[c.C1];
}
}
设置 testRepetitions
= 2500 万,并使用 256 个元素的数组 (result.Length = 256
),它在我的机器上运行大约 8.5 秒。
这是 Cards
结构:
struct Cards
{
public byte C0;
public byte C1;
public Cards(byte c0, byte c1)
{
C0 = c0;
C1 = c1;
}
}
当我修改该结构以容纳 5 张卡片(5 字节)时,相同的基准测试现在需要大约 13 秒。
为什么会这样?计算相同,剩余3张卡未使用,所有数组都足够小,可以放入L1缓存。
更重要的是 运行ger,如果我进一步将 Cards 更改为现在包含 8 个字节,基准测试现在会更快,大约需要 10 秒。
我的设置:
VS 2015 Update 3.
.NET 4.6.2
Release Build x64
CPU: Haswell i7-5820K CPU @ 3.30GHz
这是我得到的确切时间:
Test With 2 Cards. Time = 8582 ms
Test With 5 Cards. Time = 12910 ms
Test With 8 Cards. Time = 10180 ms
这是怎么回事?
基准代码:
class TestAdjustment
{
public void Test()
{
using (Process p = Process.GetCurrentProcess())
p.PriorityClass = ProcessPriorityClass.High;
var size = 256;
float[] one = ArrayUtils.CreateRandomFloatArray(size:52);
int[] card0 = ArrayUtils.RandomIntArray(size, minValue:0, maxValueInclusive:51);
int[] card1 = ArrayUtils.RandomIntArray(size, minValue: 0, maxValueInclusive: 51);
Cards[] cards = CreateCardsArray(card0, card1);
Cards5[] cards5 = CreateCards5Array(card0, card1);
Cards8[] cards8 = CreateCards8Array(card0, card1);
float[] result = ArrayUtils.CreateRandomFloatArray(size);
float[] resultClone = result.ToArray();
var testRepetitions = 25*1000*1000;
var sw = Stopwatch.StartNew();
TestCards2(testRepetitions, result, one, cards);
WriteLine($"Test With 2 Cards. Time = {sw.ElapsedMilliseconds} ms");
result = resultClone.ToArray(); //restore original array from the clone, so that next method works on the same data
sw.Restart();
TestCards5(testRepetitions, result, one, cards5);
WriteLine($"Test With 5 Cards. Time = {sw.ElapsedMilliseconds} ms");
result = resultClone.ToArray();
sw.Restart();
TestCards8(testRepetitions, result, one, cards8);
WriteLine($"Test With 8 Cards. Time = {sw.ElapsedMilliseconds} ms");
}
private void TestCards2(int testRepetitions, float[] result, float[] one, Cards[] cards)
{
for (var r = 0; r < testRepetitions; r++)
for (var i = 0; i < result.Length; i++)
{
var c = cards[i];
result[i] -= one[c.C0] + one[c.C1];
}
}
private void TestCards5(int testRepetitions, float[] result, float[] one, Cards5[] cards)
{
for (var r = 0; r < testRepetitions; r++)
for (var i = 0; i < result.Length; i++)
{
var c = cards[i];
result[i] -= one[c.C0] + one[c.C1];
}
}
private void TestCards8(int testRepetitions, float[] result, float[] one, Cards8[] cards)
{
for (var r = 0; r < testRepetitions; r++)
for (var i = 0; i < result.Length; i++)
{
var c = cards[i];
result[i] -= one[c.C0] + one[c.C1];
}
}
private Cards[] CreateCardsArray(int[] c0, int[] c1)
{
var result = new Cards[c0.Length];
for (var i = 0; i < result.Length; i++)
result[i] = new Cards((byte)c0[i], (byte)c1[i]);
return result;
}
private Cards5[] CreateCards5Array(int[] c0, int[] c1)
{
var result = new Cards5[c0.Length];
for (var i = 0; i < result.Length; i++)
result[i] = new Cards5((byte)c0[i], (byte)c1[i]);
return result;
}
private Cards8[] CreateCards8Array(int[] c0, int[] c1)
{
var result = new Cards8[c0.Length];
for (var i = 0; i < result.Length; i++)
result[i] = new Cards8((byte)c0[i], (byte)c1[i]);
return result;
}
}
struct Cards
{
public byte C0;
public byte C1;
public Cards(byte c0, byte c1)
{
C0 = c0;
C1 = c1;
}
}
struct Cards5
{
public byte C0, C1, C2, C3, C4;
public Cards5(byte c0, byte c1)
{
C0 = c0;
C1 = c1;
C2 = C3 = C4 = 0;
}
}
struct Cards8
{
public byte C0, C1, C2, C3, C4, C5, C6, C7;
public Cards8(byte c0, byte c1)
{
C0 = c0;
C1 = c1;
C2 = C3 = C4 = C5 = C6 = C7 = 0;
}
}
编辑
我再次重新运行了基准测试,迭代了 1 亿次。以下是结果:
Test With 5 Cards. Time = 52245 ms
Test With 8 Cards. Time = 40531 ms
并以相反的顺序:
Test With 8 Cards. Time = 41041 ms
Test With 5 Cards. Time = 52034 ms
运行 它在 Surface Pro 4 上(Skylake i7-6650U Turbo 加速到 ~3.4ghz):
Test With 8 Cards. Time = 47913 ms
Test With 5 Cards. Time = 55182 ms
因此差异仍然存在并且不取决于顺序。
我还 运行 使用 Intel VTune 进行分析,它显示“5 卡”版本的 CPI 为 0.3
,“8 卡”版本的 CPI 为 0.27
。
Edit2 添加了 ArrayUtils class 用于创建初始 运行dom 数组。
public static class ArrayUtils
{
static Random rand = new Random(137);
public static float[] CreateRandomFloatArray(int size)
{
var result = new float[size];
for (int i = 0; i < size; i++)
result[i] = (float) rand.NextDouble();
return result;
}
public static int[] RandomIntArray(int size, int minValue, int maxValueInclusive)
{
var result = new int[size];
for (int i = 0; i < size; i++)
result[i] = rand.Next(minValue, maxValueInclusive + 1);
return result;
}
}
我的假设是这与内存对齐有关。
技术信息:
一些体系结构,例如 MIPS 体系结构,实际上不能一次只修改内存中的一个字节。他们必须将一个数据字加载到寄存器中,屏蔽掉不相关的位,然后执行计算。
您实际上可能会通过使用普通 int 而不是字节来加快速度,因为它完全避免了这个问题。
都是关于未对齐的内存访问。未对齐内存就绪延迟大于对齐内存读取延迟。为了完成实验,让我们添加结构 Cards3
、Cards4
等等。然后让我们看看对应的数组在内存中是如何表示的。
接下来,让我们改进您的基准测试。
- 我们将使用BenchmarkDotNet(这个工具会做很多基准测试例程,如预热、自动选择迭代量、统计计算等)。
- 我们将对所有
Cards2
..Cards8
数组进行基准测试,而不仅仅是其中的 3 个。
- 我们还将检查完整 .NET Framework(LegacyJIT-x86、LegacyJIT-x64、RyuJIT-x64)和 Mono 的所有 JIT 编译器。
这是我的环境:
Host Process Environment Information:
BenchmarkDotNet.Core=v0.9.9.0
OS=Microsoft Windows NT 6.2.9200.0
Processor=Intel(R) Core(TM) i7-4810MQ CPU 2.80GHz, ProcessorCount=8
Frequency=2728068 ticks, Resolution=366.5598 ns, Timer=TSC
CLR1=MS.NET 4.0.30319.42000, Arch=64-bit RELEASE [RyuJIT]
CLR2=Mono JIT compiler version 4.4.0, Arch=32-bit
GC=Concurrent Workstation
JitModules=clrjit-v4.6.1080.0
我的结果:
Method | Platform | Jit | Toolchain | Runtime | Median | StdDev |
------- |--------- |---------- |---------- |-------- |---------- |---------- |
C2 | Host | Host | Mono | Mono | 3.9230 ns | 0.0532 ns |
C3 | Host | Host | Mono | Mono | 4.8223 ns | 0.0920 ns |
C4 | Host | Host | Mono | Mono | 5.9149 ns | 0.1207 ns |
C5 | Host | Host | Mono | Mono | 6.3981 ns | 0.0913 ns |
C6 | Host | Host | Mono | Mono | 7.1179 ns | 0.1222 ns |
C7 | Host | Host | Mono | Mono | 7.6318 ns | 0.1269 ns |
C8 | Host | Host | Mono | Mono | 8.4650 ns | 0.1497 ns |
C2 | X64 | LegacyJit | Host | Host | 2.3515 ns | 0.0150 ns |
C3 | X64 | LegacyJit | Host | Host | 4.2553 ns | 0.0700 ns |
C4 | X64 | LegacyJit | Host | Host | 1.4366 ns | 0.0385 ns |
C5 | X64 | LegacyJit | Host | Host | 2.3688 ns | 0.0359 ns |
C6 | X64 | LegacyJit | Host | Host | 2.3684 ns | 0.0404 ns |
C7 | X64 | LegacyJit | Host | Host | 3.0404 ns | 0.0664 ns |
C8 | X64 | LegacyJit | Host | Host | 1.4510 ns | 0.0333 ns |
C2 | X64 | RyuJit | Host | Host | 1.9281 ns | 0.0306 ns |
C3 | X64 | RyuJit | Host | Host | 2.1183 ns | 0.0348 ns |
C4 | X64 | RyuJit | Host | Host | 1.9395 ns | 0.0397 ns |
C5 | X64 | RyuJit | Host | Host | 2.7706 ns | 0.0387 ns |
C6 | X64 | RyuJit | Host | Host | 2.6471 ns | 0.0513 ns |
C7 | X64 | RyuJit | Host | Host | 2.9743 ns | 0.0541 ns |
C8 | X64 | RyuJit | Host | Host | 2.6280 ns | 0.1526 ns |
C2 | X86 | LegacyJit | Host | Host | 3.0854 ns | 0.2172 ns |
C3 | X86 | LegacyJit | Host | Host | 3.1627 ns | 0.1126 ns |
C4 | X86 | LegacyJit | Host | Host | 3.0577 ns | 0.0929 ns |
C5 | X86 | LegacyJit | Host | Host | 5.0957 ns | 0.1601 ns |
C6 | X86 | LegacyJit | Host | Host | 6.1723 ns | 0.1177 ns |
C7 | X86 | LegacyJit | Host | Host | 7.1155 ns | 0.0803 ns |
C8 | X86 | LegacyJit | Host | Host | 3.7703 ns | 0.1276 ns |
完整源代码:
using System;
using System.Linq;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Attributes.Exporters;
using BenchmarkDotNet.Attributes.Jobs;
using BenchmarkDotNet.Running;
[LegacyJitX86Job, LegacyJitX64Job, RyuJitX64Job, MonoJob]
[RPlotExporter]
public class CardBenchmarks
{
public const int Size = 256;
private float[] result, one;
private Cards2[] cards2;
private Cards3[] cards3;
private Cards4[] cards4;
private Cards5[] cards5;
private Cards6[] cards6;
private Cards7[] cards7;
private Cards8[] cards8;
[Setup]
public void Setup()
{
result = ArrayUtils.CreateRandomFloatArray(Size);
one = ArrayUtils.CreateRandomFloatArray(size: 52);
var c0 = ArrayUtils.RandomByteArray(Size, minValue: 0, maxValueInclusive: 51);
var c1 = ArrayUtils.RandomByteArray(Size, minValue: 0, maxValueInclusive: 51);
cards2 = CardUtls.Create2(c0, c1);
cards3 = CardUtls.Create3(c0, c1);
cards4 = CardUtls.Create4(c0, c1);
cards5 = CardUtls.Create5(c0, c1);
cards6 = CardUtls.Create6(c0, c1);
cards7 = CardUtls.Create7(c0, c1);
cards8 = CardUtls.Create8(c0, c1);
}
[Benchmark(OperationsPerInvoke = Size)]
public void C2()
{
for (var i = 0; i < result.Length; i++)
{
var c = cards2[i];
result[i] -= one[c.C0] + one[c.C1];
}
}
[Benchmark(OperationsPerInvoke = Size)]
public void C3()
{
for (var i = 0; i < result.Length; i++)
{
var c = cards3[i];
result[i] -= one[c.C0] + one[c.C1];
}
}
[Benchmark(OperationsPerInvoke = Size)]
public void C4()
{
for (var i = 0; i < result.Length; i++)
{
var c = cards4[i];
result[i] -= one[c.C0] + one[c.C1];
}
}
[Benchmark(OperationsPerInvoke = Size)]
public void C5()
{
for (var i = 0; i < result.Length; i++)
{
var c = cards5[i];
result[i] -= one[c.C0] + one[c.C1];
}
}
[Benchmark(OperationsPerInvoke = Size)]
public void C6()
{
for (var i = 0; i < result.Length; i++)
{
var c = cards6[i];
result[i] -= one[c.C0] + one[c.C1];
}
}
[Benchmark(OperationsPerInvoke = Size)]
public void C7()
{
for (var i = 0; i < result.Length; i++)
{
var c = cards7[i];
result[i] -= one[c.C0] + one[c.C1];
}
}
[Benchmark(OperationsPerInvoke = Size)]
public void C8()
{
for (var i = 0; i < result.Length; i++)
{
var c = cards8[i];
result[i] -= one[c.C0] + one[c.C1];
}
}
}
public static class ArrayUtils
{
private static readonly Random Rand = new Random(137);
public static float[] CreateRandomFloatArray(int size)
{
var result = new float[size];
for (int i = 0; i < size; i++)
result[i] = (float) Rand.NextDouble();
return result;
}
public static byte[] RandomByteArray(int size, int minValue, int maxValueInclusive)
{
var result = new byte[size];
for (int i = 0; i < size; i++)
result[i] = (byte) Rand.Next(minValue, maxValueInclusive + 1);
return result;
}
}
public static class CardUtls
{
private static T[] Create<T>(int length, Func<int, T> create) => Enumerable.Range(0, length).Select(create).ToArray();
public static Cards2[] Create2(byte[] c0, byte[] c1) => Create(c0.Length, i => new Cards2 {C0 = c0[i], C1 = c1[i]});
public static Cards3[] Create3(byte[] c0, byte[] c1) => Create(c0.Length, i => new Cards3 {C0 = c0[i], C1 = c1[i]});
public static Cards4[] Create4(byte[] c0, byte[] c1) => Create(c0.Length, i => new Cards4 {C0 = c0[i], C1 = c1[i]});
public static Cards5[] Create5(byte[] c0, byte[] c1) => Create(c0.Length, i => new Cards5 {C0 = c0[i], C1 = c1[i]});
public static Cards6[] Create6(byte[] c0, byte[] c1) => Create(c0.Length, i => new Cards6 {C0 = c0[i], C1 = c1[i]});
public static Cards7[] Create7(byte[] c0, byte[] c1) => Create(c0.Length, i => new Cards7 {C0 = c0[i], C1 = c1[i]});
public static Cards8[] Create8(byte[] c0, byte[] c1) => Create(c0.Length, i => new Cards8 {C0 = c0[i], C1 = c1[i]});
}
public struct Cards2
{
public byte C0, C1;
}
public struct Cards3
{
public byte C0, C1, C2;
}
public struct Cards4
{
public byte C0, C1, C2, C3;
}
public struct Cards5
{
public byte C0, C1, C2, C3, C4;
}
public struct Cards6
{
public byte C0, C1, C2, C3, C4, C5;
}
public struct Cards7
{
public byte C0, C1, C2, C3, C4, C5, C6;
}
public struct Cards8
{
public byte C0, C1, C2, C3, C4, C5, C6, C7;
}
class Program
{
static void Main()
{
BenchmarkRunner.Run<CardBenchmarks>();
}
}
回答
如您所见,您的基准测试很困难,有很多不同的因素会影响您的性能。最重要的事情之一是运行时如何布局数据。例如,您不会在 Mono 上观察到所描述的行为,因为 Mono 和 Full Framework 具有不同的布局算法(在 Mono 上我们有 Marshal.SizeOf(typeof(Card2)) == 8
)。
我们在完整框架上有 Time(Card5) > Time(Card8)
,因为 Card5
产生了很多未对齐的读取,这与 Card8
不同(见第一张图片)。
更新
来自的问题:
Any idea why 3-byte performs better than 8-byte on your RyuJIT64 benchmark?
我们来看asm代码:
; RyuJIT-x64, C3
var c = cards3[i];
00007FFEDF0CADCE mov r10,r9
00007FFEDF0CADD1 mov r11d,dword ptr [r10+8]
00007FFEDF0CADD5 cmp eax,r11d
00007FFEDF0CADD8 jae 00007FFEDF0CAE44
00007FFEDF0CADDA movsxd r11,eax
00007FFEDF0CADDD imul r11,r11,3
00007FFEDF0CADE1 lea r10,[r10+r11+10h]
00007FFEDF0CADE6 movzx r11d,byte ptr [r10] ; !!!
00007FFEDF0CADEA movzx r10d,byte ptr [r10+1] ; !!!
; RyuJIT-x64, C8
var c = cards8[i];
00007FFEDF0CAE8C mov rdx,qword ptr [rcx+48h]
00007FFEDF0CAE90 mov r8d,dword ptr [rdx+8]
00007FFEDF0CAE94 cmp eax,r8d
00007FFEDF0CAE97 jae 00007FFEDF0CAF0C
00007FFEDF0CAE99 movsxd r8,eax
00007FFEDF0CAE9C mov rdx,qword ptr [rdx+r8*8+10h] ; !!!
00007FFEDF0CAEA1 mov qword ptr [rsp+28h],rdx ; !!!
在C3
情况下,RyuJIT将目标字节保存在r11d
、r10d
寄存器中;在 C8
的情况下,RyuJIT 将它们保存在堆栈中 (qword ptr [rsp+28h]
)。解释:当前版本的 RyuJIT(在我的例子中是 v4.6.1080.0)对不超过 4 个字段的结构进行标量替换(参见 coreclr#6839)。因此,C5
、C6
、C7
和 C8
的 RyuJIT 性能比 C2
、C3
、[=33= 的性能差].注意:此行为可能会在 RyuJIT 的未来版本中更改。
我有一个代码可以根据另一个小数组中的值更新一个数组。
for (var i = 0; i < result.Length; i++)
{
var c = cards[i];
result[i] -= one[c.C0] + one[c.C1];
}
其中 c
是一个结构,它是表示一副牌中卡片的一对字节。
one
是一个大小为 52 的数组(每副牌有 52 张牌的条目)
我写了一个基准来分析这段代码:
private void TestCards2(int testRepetitions, float[] result, float[] one, Cards[] cards)
{
for (var r = 0; r < testRepetitions; r++)
for (var i = 0; i < result.Length; i++)
{
var c = cards[i];
result[i] -= one[c.C0] + one[c.C1];
}
}
设置 testRepetitions
= 2500 万,并使用 256 个元素的数组 (result.Length = 256
),它在我的机器上运行大约 8.5 秒。
这是 Cards
结构:
struct Cards
{
public byte C0;
public byte C1;
public Cards(byte c0, byte c1)
{
C0 = c0;
C1 = c1;
}
}
当我修改该结构以容纳 5 张卡片(5 字节)时,相同的基准测试现在需要大约 13 秒。 为什么会这样?计算相同,剩余3张卡未使用,所有数组都足够小,可以放入L1缓存。
更重要的是 运行ger,如果我进一步将 Cards 更改为现在包含 8 个字节,基准测试现在会更快,大约需要 10 秒。
我的设置:
VS 2015 Update 3.
.NET 4.6.2
Release Build x64
CPU: Haswell i7-5820K CPU @ 3.30GHz
这是我得到的确切时间:
Test With 2 Cards. Time = 8582 ms
Test With 5 Cards. Time = 12910 ms
Test With 8 Cards. Time = 10180 ms
这是怎么回事?
基准代码:
class TestAdjustment
{
public void Test()
{
using (Process p = Process.GetCurrentProcess())
p.PriorityClass = ProcessPriorityClass.High;
var size = 256;
float[] one = ArrayUtils.CreateRandomFloatArray(size:52);
int[] card0 = ArrayUtils.RandomIntArray(size, minValue:0, maxValueInclusive:51);
int[] card1 = ArrayUtils.RandomIntArray(size, minValue: 0, maxValueInclusive: 51);
Cards[] cards = CreateCardsArray(card0, card1);
Cards5[] cards5 = CreateCards5Array(card0, card1);
Cards8[] cards8 = CreateCards8Array(card0, card1);
float[] result = ArrayUtils.CreateRandomFloatArray(size);
float[] resultClone = result.ToArray();
var testRepetitions = 25*1000*1000;
var sw = Stopwatch.StartNew();
TestCards2(testRepetitions, result, one, cards);
WriteLine($"Test With 2 Cards. Time = {sw.ElapsedMilliseconds} ms");
result = resultClone.ToArray(); //restore original array from the clone, so that next method works on the same data
sw.Restart();
TestCards5(testRepetitions, result, one, cards5);
WriteLine($"Test With 5 Cards. Time = {sw.ElapsedMilliseconds} ms");
result = resultClone.ToArray();
sw.Restart();
TestCards8(testRepetitions, result, one, cards8);
WriteLine($"Test With 8 Cards. Time = {sw.ElapsedMilliseconds} ms");
}
private void TestCards2(int testRepetitions, float[] result, float[] one, Cards[] cards)
{
for (var r = 0; r < testRepetitions; r++)
for (var i = 0; i < result.Length; i++)
{
var c = cards[i];
result[i] -= one[c.C0] + one[c.C1];
}
}
private void TestCards5(int testRepetitions, float[] result, float[] one, Cards5[] cards)
{
for (var r = 0; r < testRepetitions; r++)
for (var i = 0; i < result.Length; i++)
{
var c = cards[i];
result[i] -= one[c.C0] + one[c.C1];
}
}
private void TestCards8(int testRepetitions, float[] result, float[] one, Cards8[] cards)
{
for (var r = 0; r < testRepetitions; r++)
for (var i = 0; i < result.Length; i++)
{
var c = cards[i];
result[i] -= one[c.C0] + one[c.C1];
}
}
private Cards[] CreateCardsArray(int[] c0, int[] c1)
{
var result = new Cards[c0.Length];
for (var i = 0; i < result.Length; i++)
result[i] = new Cards((byte)c0[i], (byte)c1[i]);
return result;
}
private Cards5[] CreateCards5Array(int[] c0, int[] c1)
{
var result = new Cards5[c0.Length];
for (var i = 0; i < result.Length; i++)
result[i] = new Cards5((byte)c0[i], (byte)c1[i]);
return result;
}
private Cards8[] CreateCards8Array(int[] c0, int[] c1)
{
var result = new Cards8[c0.Length];
for (var i = 0; i < result.Length; i++)
result[i] = new Cards8((byte)c0[i], (byte)c1[i]);
return result;
}
}
struct Cards
{
public byte C0;
public byte C1;
public Cards(byte c0, byte c1)
{
C0 = c0;
C1 = c1;
}
}
struct Cards5
{
public byte C0, C1, C2, C3, C4;
public Cards5(byte c0, byte c1)
{
C0 = c0;
C1 = c1;
C2 = C3 = C4 = 0;
}
}
struct Cards8
{
public byte C0, C1, C2, C3, C4, C5, C6, C7;
public Cards8(byte c0, byte c1)
{
C0 = c0;
C1 = c1;
C2 = C3 = C4 = C5 = C6 = C7 = 0;
}
}
编辑 我再次重新运行了基准测试,迭代了 1 亿次。以下是结果:
Test With 5 Cards. Time = 52245 ms
Test With 8 Cards. Time = 40531 ms
并以相反的顺序:
Test With 8 Cards. Time = 41041 ms
Test With 5 Cards. Time = 52034 ms
运行 它在 Surface Pro 4 上(Skylake i7-6650U Turbo 加速到 ~3.4ghz):
Test With 8 Cards. Time = 47913 ms
Test With 5 Cards. Time = 55182 ms
因此差异仍然存在并且不取决于顺序。
我还 运行 使用 Intel VTune 进行分析,它显示“5 卡”版本的 CPI 为 0.3
,“8 卡”版本的 CPI 为 0.27
。
Edit2 添加了 ArrayUtils class 用于创建初始 运行dom 数组。
public static class ArrayUtils
{
static Random rand = new Random(137);
public static float[] CreateRandomFloatArray(int size)
{
var result = new float[size];
for (int i = 0; i < size; i++)
result[i] = (float) rand.NextDouble();
return result;
}
public static int[] RandomIntArray(int size, int minValue, int maxValueInclusive)
{
var result = new int[size];
for (int i = 0; i < size; i++)
result[i] = rand.Next(minValue, maxValueInclusive + 1);
return result;
}
}
我的假设是这与内存对齐有关。
技术信息:
一些体系结构,例如 MIPS 体系结构,实际上不能一次只修改内存中的一个字节。他们必须将一个数据字加载到寄存器中,屏蔽掉不相关的位,然后执行计算。
您实际上可能会通过使用普通 int 而不是字节来加快速度,因为它完全避免了这个问题。
都是关于未对齐的内存访问。未对齐内存就绪延迟大于对齐内存读取延迟。为了完成实验,让我们添加结构 Cards3
、Cards4
等等。然后让我们看看对应的数组在内存中是如何表示的。
接下来,让我们改进您的基准测试。
- 我们将使用BenchmarkDotNet(这个工具会做很多基准测试例程,如预热、自动选择迭代量、统计计算等)。
- 我们将对所有
Cards2
..Cards8
数组进行基准测试,而不仅仅是其中的 3 个。 - 我们还将检查完整 .NET Framework(LegacyJIT-x86、LegacyJIT-x64、RyuJIT-x64)和 Mono 的所有 JIT 编译器。
这是我的环境:
Host Process Environment Information:
BenchmarkDotNet.Core=v0.9.9.0
OS=Microsoft Windows NT 6.2.9200.0
Processor=Intel(R) Core(TM) i7-4810MQ CPU 2.80GHz, ProcessorCount=8
Frequency=2728068 ticks, Resolution=366.5598 ns, Timer=TSC
CLR1=MS.NET 4.0.30319.42000, Arch=64-bit RELEASE [RyuJIT]
CLR2=Mono JIT compiler version 4.4.0, Arch=32-bit
GC=Concurrent Workstation
JitModules=clrjit-v4.6.1080.0
我的结果:
Method | Platform | Jit | Toolchain | Runtime | Median | StdDev |
------- |--------- |---------- |---------- |-------- |---------- |---------- |
C2 | Host | Host | Mono | Mono | 3.9230 ns | 0.0532 ns |
C3 | Host | Host | Mono | Mono | 4.8223 ns | 0.0920 ns |
C4 | Host | Host | Mono | Mono | 5.9149 ns | 0.1207 ns |
C5 | Host | Host | Mono | Mono | 6.3981 ns | 0.0913 ns |
C6 | Host | Host | Mono | Mono | 7.1179 ns | 0.1222 ns |
C7 | Host | Host | Mono | Mono | 7.6318 ns | 0.1269 ns |
C8 | Host | Host | Mono | Mono | 8.4650 ns | 0.1497 ns |
C2 | X64 | LegacyJit | Host | Host | 2.3515 ns | 0.0150 ns |
C3 | X64 | LegacyJit | Host | Host | 4.2553 ns | 0.0700 ns |
C4 | X64 | LegacyJit | Host | Host | 1.4366 ns | 0.0385 ns |
C5 | X64 | LegacyJit | Host | Host | 2.3688 ns | 0.0359 ns |
C6 | X64 | LegacyJit | Host | Host | 2.3684 ns | 0.0404 ns |
C7 | X64 | LegacyJit | Host | Host | 3.0404 ns | 0.0664 ns |
C8 | X64 | LegacyJit | Host | Host | 1.4510 ns | 0.0333 ns |
C2 | X64 | RyuJit | Host | Host | 1.9281 ns | 0.0306 ns |
C3 | X64 | RyuJit | Host | Host | 2.1183 ns | 0.0348 ns |
C4 | X64 | RyuJit | Host | Host | 1.9395 ns | 0.0397 ns |
C5 | X64 | RyuJit | Host | Host | 2.7706 ns | 0.0387 ns |
C6 | X64 | RyuJit | Host | Host | 2.6471 ns | 0.0513 ns |
C7 | X64 | RyuJit | Host | Host | 2.9743 ns | 0.0541 ns |
C8 | X64 | RyuJit | Host | Host | 2.6280 ns | 0.1526 ns |
C2 | X86 | LegacyJit | Host | Host | 3.0854 ns | 0.2172 ns |
C3 | X86 | LegacyJit | Host | Host | 3.1627 ns | 0.1126 ns |
C4 | X86 | LegacyJit | Host | Host | 3.0577 ns | 0.0929 ns |
C5 | X86 | LegacyJit | Host | Host | 5.0957 ns | 0.1601 ns |
C6 | X86 | LegacyJit | Host | Host | 6.1723 ns | 0.1177 ns |
C7 | X86 | LegacyJit | Host | Host | 7.1155 ns | 0.0803 ns |
C8 | X86 | LegacyJit | Host | Host | 3.7703 ns | 0.1276 ns |
完整源代码:
using System;
using System.Linq;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Attributes.Exporters;
using BenchmarkDotNet.Attributes.Jobs;
using BenchmarkDotNet.Running;
[LegacyJitX86Job, LegacyJitX64Job, RyuJitX64Job, MonoJob]
[RPlotExporter]
public class CardBenchmarks
{
public const int Size = 256;
private float[] result, one;
private Cards2[] cards2;
private Cards3[] cards3;
private Cards4[] cards4;
private Cards5[] cards5;
private Cards6[] cards6;
private Cards7[] cards7;
private Cards8[] cards8;
[Setup]
public void Setup()
{
result = ArrayUtils.CreateRandomFloatArray(Size);
one = ArrayUtils.CreateRandomFloatArray(size: 52);
var c0 = ArrayUtils.RandomByteArray(Size, minValue: 0, maxValueInclusive: 51);
var c1 = ArrayUtils.RandomByteArray(Size, minValue: 0, maxValueInclusive: 51);
cards2 = CardUtls.Create2(c0, c1);
cards3 = CardUtls.Create3(c0, c1);
cards4 = CardUtls.Create4(c0, c1);
cards5 = CardUtls.Create5(c0, c1);
cards6 = CardUtls.Create6(c0, c1);
cards7 = CardUtls.Create7(c0, c1);
cards8 = CardUtls.Create8(c0, c1);
}
[Benchmark(OperationsPerInvoke = Size)]
public void C2()
{
for (var i = 0; i < result.Length; i++)
{
var c = cards2[i];
result[i] -= one[c.C0] + one[c.C1];
}
}
[Benchmark(OperationsPerInvoke = Size)]
public void C3()
{
for (var i = 0; i < result.Length; i++)
{
var c = cards3[i];
result[i] -= one[c.C0] + one[c.C1];
}
}
[Benchmark(OperationsPerInvoke = Size)]
public void C4()
{
for (var i = 0; i < result.Length; i++)
{
var c = cards4[i];
result[i] -= one[c.C0] + one[c.C1];
}
}
[Benchmark(OperationsPerInvoke = Size)]
public void C5()
{
for (var i = 0; i < result.Length; i++)
{
var c = cards5[i];
result[i] -= one[c.C0] + one[c.C1];
}
}
[Benchmark(OperationsPerInvoke = Size)]
public void C6()
{
for (var i = 0; i < result.Length; i++)
{
var c = cards6[i];
result[i] -= one[c.C0] + one[c.C1];
}
}
[Benchmark(OperationsPerInvoke = Size)]
public void C7()
{
for (var i = 0; i < result.Length; i++)
{
var c = cards7[i];
result[i] -= one[c.C0] + one[c.C1];
}
}
[Benchmark(OperationsPerInvoke = Size)]
public void C8()
{
for (var i = 0; i < result.Length; i++)
{
var c = cards8[i];
result[i] -= one[c.C0] + one[c.C1];
}
}
}
public static class ArrayUtils
{
private static readonly Random Rand = new Random(137);
public static float[] CreateRandomFloatArray(int size)
{
var result = new float[size];
for (int i = 0; i < size; i++)
result[i] = (float) Rand.NextDouble();
return result;
}
public static byte[] RandomByteArray(int size, int minValue, int maxValueInclusive)
{
var result = new byte[size];
for (int i = 0; i < size; i++)
result[i] = (byte) Rand.Next(minValue, maxValueInclusive + 1);
return result;
}
}
public static class CardUtls
{
private static T[] Create<T>(int length, Func<int, T> create) => Enumerable.Range(0, length).Select(create).ToArray();
public static Cards2[] Create2(byte[] c0, byte[] c1) => Create(c0.Length, i => new Cards2 {C0 = c0[i], C1 = c1[i]});
public static Cards3[] Create3(byte[] c0, byte[] c1) => Create(c0.Length, i => new Cards3 {C0 = c0[i], C1 = c1[i]});
public static Cards4[] Create4(byte[] c0, byte[] c1) => Create(c0.Length, i => new Cards4 {C0 = c0[i], C1 = c1[i]});
public static Cards5[] Create5(byte[] c0, byte[] c1) => Create(c0.Length, i => new Cards5 {C0 = c0[i], C1 = c1[i]});
public static Cards6[] Create6(byte[] c0, byte[] c1) => Create(c0.Length, i => new Cards6 {C0 = c0[i], C1 = c1[i]});
public static Cards7[] Create7(byte[] c0, byte[] c1) => Create(c0.Length, i => new Cards7 {C0 = c0[i], C1 = c1[i]});
public static Cards8[] Create8(byte[] c0, byte[] c1) => Create(c0.Length, i => new Cards8 {C0 = c0[i], C1 = c1[i]});
}
public struct Cards2
{
public byte C0, C1;
}
public struct Cards3
{
public byte C0, C1, C2;
}
public struct Cards4
{
public byte C0, C1, C2, C3;
}
public struct Cards5
{
public byte C0, C1, C2, C3, C4;
}
public struct Cards6
{
public byte C0, C1, C2, C3, C4, C5;
}
public struct Cards7
{
public byte C0, C1, C2, C3, C4, C5, C6;
}
public struct Cards8
{
public byte C0, C1, C2, C3, C4, C5, C6, C7;
}
class Program
{
static void Main()
{
BenchmarkRunner.Run<CardBenchmarks>();
}
}
回答
如您所见,您的基准测试很困难,有很多不同的因素会影响您的性能。最重要的事情之一是运行时如何布局数据。例如,您不会在 Mono 上观察到所描述的行为,因为 Mono 和 Full Framework 具有不同的布局算法(在 Mono 上我们有 Marshal.SizeOf(typeof(Card2)) == 8
)。
我们在完整框架上有 Time(Card5) > Time(Card8)
,因为 Card5
产生了很多未对齐的读取,这与 Card8
不同(见第一张图片)。
更新
来自
Any idea why 3-byte performs better than 8-byte on your RyuJIT64 benchmark?
我们来看asm代码:
; RyuJIT-x64, C3
var c = cards3[i];
00007FFEDF0CADCE mov r10,r9
00007FFEDF0CADD1 mov r11d,dword ptr [r10+8]
00007FFEDF0CADD5 cmp eax,r11d
00007FFEDF0CADD8 jae 00007FFEDF0CAE44
00007FFEDF0CADDA movsxd r11,eax
00007FFEDF0CADDD imul r11,r11,3
00007FFEDF0CADE1 lea r10,[r10+r11+10h]
00007FFEDF0CADE6 movzx r11d,byte ptr [r10] ; !!!
00007FFEDF0CADEA movzx r10d,byte ptr [r10+1] ; !!!
; RyuJIT-x64, C8
var c = cards8[i];
00007FFEDF0CAE8C mov rdx,qword ptr [rcx+48h]
00007FFEDF0CAE90 mov r8d,dword ptr [rdx+8]
00007FFEDF0CAE94 cmp eax,r8d
00007FFEDF0CAE97 jae 00007FFEDF0CAF0C
00007FFEDF0CAE99 movsxd r8,eax
00007FFEDF0CAE9C mov rdx,qword ptr [rdx+r8*8+10h] ; !!!
00007FFEDF0CAEA1 mov qword ptr [rsp+28h],rdx ; !!!
在C3
情况下,RyuJIT将目标字节保存在r11d
、r10d
寄存器中;在 C8
的情况下,RyuJIT 将它们保存在堆栈中 (qword ptr [rsp+28h]
)。解释:当前版本的 RyuJIT(在我的例子中是 v4.6.1080.0)对不超过 4 个字段的结构进行标量替换(参见 coreclr#6839)。因此,C5
、C6
、C7
和 C8
的 RyuJIT 性能比 C2
、C3
、[=33= 的性能差].注意:此行为可能会在 RyuJIT 的未来版本中更改。