实现 C# 硬件内在函数包装器问题
Implementing C# hardware intrinsics wrapper issue
我正在尝试使用硬件内在的强大功能,只是为了测试创建一个基于 Avx2 指令的函数,并将其与我当前完全没有内在的 Vector 实现进行比较。
当我对 2 个函数做同样的基准测试时,我印象深刻的是内部函数实际上慢了 2 倍。我对此进行了调查,发现计算本身快了约 3.8 倍,但是当我开始创建包装器结构和 return 结果时,它实际上花费了最多的时间。
这是我对内在方法的实现:
public static Vector4FHW Subtract(Vector4FHW left, Vector4FHW right)
{
if (Avx2.IsSupported)
{
var left1 = Vector128.Create(left.X, left.Y, left.Z, left.W);
var right1 = Vector128.Create(right.X, right.Y, right.Z, right.W);
var result = Avx2.Subtract(left1, right1);
var x = result.GetElement(0);
var y = result.GetElement(1);
var z = result.GetElement(2);
var w = result.GetElement(3);
return new Vector4FHW(x, y, z, w);
}
return default;
}
这是我对旧 Vector 的简单实现:
public static void Subtract(ref Vector3F left, ref Vector3F right, out Vector3F result)
{
result = new Vector3F(left.X - right.X, left.Y - right.Y, left.Z - right.Z);
}
我用 BenchmarkDotNet 做了基准测试,我调用了 Subtract 1 000 000 次,这是我的结果:
有硬件支持我有 ~3170 us,没有 - 970 us
我的主要问题是:与我的旧实现相比,创建带值的 C# 结构需要太长时间,我做错了什么 and/or 我可以在这里做一些额外的优化吗?
更新
我的Vector4FHW和Vector3F其实结构一样。它们看起来像这样:
[StructLayout(LayoutKind.Sequential)]
public struct Vector4FHW
{
public float X;
public float Y;
public float Z;
public float W;
public Vector4FHW(float x, float y, float z, float w)
{
X = x;
Y = y;
Z = z;
W = w;
}
//...
}
这是我的测试。它们也很简单:
[Benchmark]
public void SubtractBenchMarkAccelerated()
{
for (int i = 0; i < 1000000; i++)
{
Vector4FHW.Subtract(new Vector4FHW(1, 20, 60, 15),new Vector4FHW(20, 48, 79, 19));
}
}
[Benchmark]
public void SubtractBenchMark()
{
for (int i = 0; i < 1000000; i++)
{
Vector4F.Subtract(new Vector4F(1, 20, 60, 15), new Vector4F(20, 48, 79, 19));
}
}
这样就可以做一个3+3双打的单次运算了
[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static Vector3D Substract(Vector3D left, Vector3D right)
{
Vector256<double> v0 = Vector256.Create(left.X, left.Y, left.Z, 0);
Vector256<double> v1 = Vector256.Create(right.X, right.Y, right.Z, 0);
Vector256<double> result = Avx.Subtract(v0, v1);
return new Vector3D(result.GetElement(0), result.GetElement(1), result.GetElement(2));
}
MethodImplOptions.AggressiveInlining
告诉编译器将方法的代码嵌入到调用者的主体中(如果可能)。输出程序集中没有方法调用,只有计算。
它可能会更快,但您的测试有 2 个问题。
- 不要检查
Avx2.IsSupported
每次操作,每个应用程序寿命检查一次。
- 不要在循环中创建数据,内存分配会使测试变慢和脏。
干净的测试看起来像这样
[Benchmark]
public void SubtractBenchMarkAccelerated()
{
Vector3D vector1 = new Vector3D(1.5, 2.5, 3.5);
Vector3D vector2 = new Vector3D(0.1, 0.2, 0.3);
for (int i = 0; i < 1000000; i++)
{
Subtract(vector1, vector2);
}
}
但如果只使用Vector256
容量的75%,就会出现单机操作的问题。能不能快25%?是的,有更多数据。
这只是故事的开始。假设您想一次计算 4 组向量。 4 对 4。表演魔术开始的地方。
[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static Vector3D[] SubstractArray(Vector3D[] left, Vector3D[] right)
{
var v0 = MemoryMarshal.Cast<Vector3D, Vector256<double>>(left);
var v1 = MemoryMarshal.Cast<Vector3D, Vector256<double>>(right);
Vector3D[] result = new Vector3D[left.Length];
var r = MemoryMarshal.Cast<Vector3D, Vector256<double>>(result);
for (int i = 0; i < v0.Length; i++) // v0.Length = 3 here, not 4
{
r[i] = Avx.Subtract(v0[i], v1[i]);
}
return result;
}
MemoryMarshal.Cast
不复制任何东西,它只是使 Span<T>
指向与源数组相同的内存,因此它快如闪电。我测试了。
测试可以像这样。
[Benchmark]
public void SubtractBenchMarkAccelerated4()
{
Vector3D[] array1 = new Vector3D[4];
array1[0] = new Vector3D(1.5, 2.5, 3.5);
array1[1] = new Vector3D(1.5, 2.5, 3.5);
array1[2] = new Vector3D(1.5, 2.5, 3.5);
array1[3] = new Vector3D(1.5, 2.5, 3.5);
Vector3D[] array2 = new Vector3D[4];
array2[0] = new Vector3D(0.1, 0.2, 0.3);
array2[1] = new Vector3D(0.1, 0.2, 0.3);
array2[2] = new Vector3D(0.1, 0.2, 0.3);
array2[3] = new Vector3D(0.1, 0.2, 0.3);
for (int i = 0; i < 1000000; i++)
{
SubstractArray(array1, array2);
}
}
在计算 4000000 个向量的同时计算 1000000 个向量,为什么不呢?您可以通过这种方式计算任意数量的向量。只要确保 doubles count % 4 == 0
.
它能比上面的例子更快吗?是的,但只有不安全的代码。
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static unsafe Vector3D[] SubstractArrayUnsafe(Vector3D[] left, Vector3D[] right)
{
var v0 = MemoryMarshal.Cast<Vector3D, Vector256<double>>(left);
var v1 = MemoryMarshal.Cast<Vector3D, Vector256<double>>(right);
Vector3D[] result = new Vector3D[left.Length];
var r = MemoryMarshal.Cast<Vector3D, Vector256<double>>(result);
fixed (Vector256<double>* vPtr0 = v0, vPtr1 = v1, rPtr = r)
{
Vector256<double>* endPtr0 = vPtr0 + v0.Length;
Vector256<double>* vPos0 = vPtr0;
Vector256<double>* vPos1 = vPtr1;
Vector256<double>* rPos = rPtr;
while (vPos0 < endPtr0)
{
*rPos = Avx.Subtract(*vPos0, *vPos1);
vPos0++;
vPos1++;
rPos++;
}
}
return result;
}
您不仅可以用这种方式减去 Vector3D[]
,还可以减去您的 Vector4D[]
或 double[]
数组。
另请访问这些有用的页面:x86/x64 SIMD Instruction List (SSE to AVX512) and this one。
更新
针对相同大小的包裹优化单个操作
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static Vector4DHW Substract(ref Vector4DHW left, ref Vector4DHW right)
{
var left1 = Unsafe.As<Vector4DHW, Vector256<double>>(ref left);
var right1 = Unsafe.As<Vector4DHW, Vector256<double>>(ref right);
var result = Avx.Subtract(left1, right1);
return Unsafe.As<Vector256<double>, Vector4DHW>(ref result);
}
让我们进行基准测试
class Program
{
static void Main()
{
var summary = BenchmarkRunner.Run<MyBenchmark>();
Console.ReadKey();
}
}
[StructLayout(LayoutKind.Sequential)]
public struct Vector4DHW
{
public double X;
public double Y;
public double Z;
public double W;
public Vector4DHW(double x, double y, double z, double w)
{
X = x;
Y = y;
Z = z;
W = w;
}
}
public class MyBenchmark
{
private Vector4DHW vector1 = new Vector4DHW(1.5, 2.5, 3.5, 4.5);
private Vector4DHW vector2 = new Vector4DHW(0.1, 0.2, 0.3, 0.4);
[Benchmark]
public void Loop()
{
for (int i = 0; i < 1000000; i++)
{
var j = i;
}
}
[Benchmark]
public void Substract()
{
for (int i = 0; i < 1000000; i++)
{
var result = Substract(ref vector1, ref vector2);
}
}
[Benchmark]
public void SubstractAvx()
{
for (int i = 0; i < 1000000; i++)
{
var result = SubstractAvx(ref vector1, ref vector2);
}
}
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static Vector4DHW Substract(ref Vector4DHW left, ref Vector4DHW right)
{
return new Vector4DHW(left.X - right.X, left.Y - right.Y, left.Z - right.Z, left.W - right.W);
}
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static Vector4DHW SubstractAvx(ref Vector4DHW left, ref Vector4DHW right)
{
var left1 = Unsafe.As<Vector4DHW, Vector256<double>>(ref left);
var right1 = Unsafe.As<Vector4DHW, Vector256<double>>(ref right);
var result = Avx.Subtract(left1, right1);
return Unsafe.As<Vector256<double>, Vector4DHW>(ref result);
}
}
去吧!
BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19042
Intel Core i7-4700HQ CPU 2.40GHz (Haswell), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.102
[Host] : .NET Core 5.0.2 (CoreCLR 5.0.220.61120, CoreFX 5.0.220.61120), X64 RyuJIT
DefaultJob : .NET Core 5.0.2 (CoreCLR 5.0.220.61120, CoreFX 5.0.220.61120), X64 RyuJIT
| Method | Mean | Error | StdDev |
|------------- |-----------:|--------:|--------:|
| Loop | 317.6 us | 1.36 us | 1.21 us |
| Substract | 1,427.0 us | 4.14 us | 3.46 us |
| SubstractAvx | 478.0 us | 1.58 us | 1.40 us |
总而言之,当您试图在性能上节省几微秒时,内存优化非常重要。甚至 Stack 分配也很重要,无论其闪电般的速度如何。最后,for
循环开销消耗了 478
微秒的大量时间。这就是我单独测量 Loop
开销的原因。
让我们计算一下 AVX 的性能增益。
1,427.0 - 317.6 = 1109.4
478.0 - 317.6 = 160.4
1109.4 / 160.4 = 6.92
AVX 几乎快 7 倍。
更新2
也测试这个
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public unsafe static Vector4DHW Substract(Vector4DHW left, Vector4DHW right)
{
var result = Avx.Subtract(*(Vector256<double>*)&left, *(Vector256<double>*)&right);
return *(Vector4DHW*)&result;
}
@aepot 好吧,我考虑了你的评论并试了一下,这是我的结果:
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static Vector4DHW Subtract(Vector4DHW left, Vector4DHW right)
{
var left1 = Vector256.Create(left.X, left.Y, left.Z, 0);
var right1 = Vector256.Create(right.X, right.Y, right.Z, 0);
var result = Avx2.Subtract(left1, right1);
return new Vector4DHW(result.GetElement(0), result.GetElement(1),
result.GetElement(2), result.GetElement(3));
}
如果我这样使用它,我会收到 ~2470 us
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static unsafe Vector4DHW Subtract(Vector4DHW left, Vector4DHW right)
{
var left1 = Vector256.Create(left.X, left.Y, left.Z, left.W);
var right1 = Vector256.Create(right.X, right.Y, right.Z, left.W);
var result = Avx2.Subtract(left1, right1);
double* value = stackalloc double[4];
Avx2.Store(value, result);
return new Vector4DHW(value[0], value[1], value[2], value[3]);
}
这个变体给我 ~2089 us
即使我像这样使我的方法无效并且根本不进行任何转换:
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static void Subtract(Vector4DHW left, Vector4DHW right)
{
var left1 = Vector256.Create(left.X, left.Y, left.Z, left.W);
var right1 = Vector256.Create(right.X, right.Y, right.Z, left.W);
var result = Avx2.Subtract(left1, right1);
}
它会给我 ~990 us。
这是内部函数更快的唯一情况,但我需要让用户有可能查看计算结果
纯 Vector4D 的计算接近 1990 - 2000 us
因此,我看不到使用内部函数进行此类计算有任何好处。也许在某些情况下会更快,但我认为每种情况都应该单独考虑
我正在尝试使用硬件内在的强大功能,只是为了测试创建一个基于 Avx2 指令的函数,并将其与我当前完全没有内在的 Vector 实现进行比较。
当我对 2 个函数做同样的基准测试时,我印象深刻的是内部函数实际上慢了 2 倍。我对此进行了调查,发现计算本身快了约 3.8 倍,但是当我开始创建包装器结构和 return 结果时,它实际上花费了最多的时间。
这是我对内在方法的实现:
public static Vector4FHW Subtract(Vector4FHW left, Vector4FHW right)
{
if (Avx2.IsSupported)
{
var left1 = Vector128.Create(left.X, left.Y, left.Z, left.W);
var right1 = Vector128.Create(right.X, right.Y, right.Z, right.W);
var result = Avx2.Subtract(left1, right1);
var x = result.GetElement(0);
var y = result.GetElement(1);
var z = result.GetElement(2);
var w = result.GetElement(3);
return new Vector4FHW(x, y, z, w);
}
return default;
}
这是我对旧 Vector 的简单实现:
public static void Subtract(ref Vector3F left, ref Vector3F right, out Vector3F result)
{
result = new Vector3F(left.X - right.X, left.Y - right.Y, left.Z - right.Z);
}
我用 BenchmarkDotNet 做了基准测试,我调用了 Subtract 1 000 000 次,这是我的结果:
有硬件支持我有 ~3170 us,没有 - 970 us
我的主要问题是:与我的旧实现相比,创建带值的 C# 结构需要太长时间,我做错了什么 and/or 我可以在这里做一些额外的优化吗?
更新
我的Vector4FHW和Vector3F其实结构一样。它们看起来像这样:
[StructLayout(LayoutKind.Sequential)]
public struct Vector4FHW
{
public float X;
public float Y;
public float Z;
public float W;
public Vector4FHW(float x, float y, float z, float w)
{
X = x;
Y = y;
Z = z;
W = w;
}
//...
}
这是我的测试。它们也很简单:
[Benchmark]
public void SubtractBenchMarkAccelerated()
{
for (int i = 0; i < 1000000; i++)
{
Vector4FHW.Subtract(new Vector4FHW(1, 20, 60, 15),new Vector4FHW(20, 48, 79, 19));
}
}
[Benchmark]
public void SubtractBenchMark()
{
for (int i = 0; i < 1000000; i++)
{
Vector4F.Subtract(new Vector4F(1, 20, 60, 15), new Vector4F(20, 48, 79, 19));
}
}
这样就可以做一个3+3双打的单次运算了
[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static Vector3D Substract(Vector3D left, Vector3D right)
{
Vector256<double> v0 = Vector256.Create(left.X, left.Y, left.Z, 0);
Vector256<double> v1 = Vector256.Create(right.X, right.Y, right.Z, 0);
Vector256<double> result = Avx.Subtract(v0, v1);
return new Vector3D(result.GetElement(0), result.GetElement(1), result.GetElement(2));
}
MethodImplOptions.AggressiveInlining
告诉编译器将方法的代码嵌入到调用者的主体中(如果可能)。输出程序集中没有方法调用,只有计算。
它可能会更快,但您的测试有 2 个问题。
- 不要检查
Avx2.IsSupported
每次操作,每个应用程序寿命检查一次。 - 不要在循环中创建数据,内存分配会使测试变慢和脏。
干净的测试看起来像这样
[Benchmark]
public void SubtractBenchMarkAccelerated()
{
Vector3D vector1 = new Vector3D(1.5, 2.5, 3.5);
Vector3D vector2 = new Vector3D(0.1, 0.2, 0.3);
for (int i = 0; i < 1000000; i++)
{
Subtract(vector1, vector2);
}
}
但如果只使用Vector256
容量的75%,就会出现单机操作的问题。能不能快25%?是的,有更多数据。
这只是故事的开始。假设您想一次计算 4 组向量。 4 对 4。表演魔术开始的地方。
[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static Vector3D[] SubstractArray(Vector3D[] left, Vector3D[] right)
{
var v0 = MemoryMarshal.Cast<Vector3D, Vector256<double>>(left);
var v1 = MemoryMarshal.Cast<Vector3D, Vector256<double>>(right);
Vector3D[] result = new Vector3D[left.Length];
var r = MemoryMarshal.Cast<Vector3D, Vector256<double>>(result);
for (int i = 0; i < v0.Length; i++) // v0.Length = 3 here, not 4
{
r[i] = Avx.Subtract(v0[i], v1[i]);
}
return result;
}
MemoryMarshal.Cast
不复制任何东西,它只是使 Span<T>
指向与源数组相同的内存,因此它快如闪电。我测试了。
测试可以像这样。
[Benchmark]
public void SubtractBenchMarkAccelerated4()
{
Vector3D[] array1 = new Vector3D[4];
array1[0] = new Vector3D(1.5, 2.5, 3.5);
array1[1] = new Vector3D(1.5, 2.5, 3.5);
array1[2] = new Vector3D(1.5, 2.5, 3.5);
array1[3] = new Vector3D(1.5, 2.5, 3.5);
Vector3D[] array2 = new Vector3D[4];
array2[0] = new Vector3D(0.1, 0.2, 0.3);
array2[1] = new Vector3D(0.1, 0.2, 0.3);
array2[2] = new Vector3D(0.1, 0.2, 0.3);
array2[3] = new Vector3D(0.1, 0.2, 0.3);
for (int i = 0; i < 1000000; i++)
{
SubstractArray(array1, array2);
}
}
在计算 4000000 个向量的同时计算 1000000 个向量,为什么不呢?您可以通过这种方式计算任意数量的向量。只要确保 doubles count % 4 == 0
.
它能比上面的例子更快吗?是的,但只有不安全的代码。
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static unsafe Vector3D[] SubstractArrayUnsafe(Vector3D[] left, Vector3D[] right)
{
var v0 = MemoryMarshal.Cast<Vector3D, Vector256<double>>(left);
var v1 = MemoryMarshal.Cast<Vector3D, Vector256<double>>(right);
Vector3D[] result = new Vector3D[left.Length];
var r = MemoryMarshal.Cast<Vector3D, Vector256<double>>(result);
fixed (Vector256<double>* vPtr0 = v0, vPtr1 = v1, rPtr = r)
{
Vector256<double>* endPtr0 = vPtr0 + v0.Length;
Vector256<double>* vPos0 = vPtr0;
Vector256<double>* vPos1 = vPtr1;
Vector256<double>* rPos = rPtr;
while (vPos0 < endPtr0)
{
*rPos = Avx.Subtract(*vPos0, *vPos1);
vPos0++;
vPos1++;
rPos++;
}
}
return result;
}
您不仅可以用这种方式减去 Vector3D[]
,还可以减去您的 Vector4D[]
或 double[]
数组。
另请访问这些有用的页面:x86/x64 SIMD Instruction List (SSE to AVX512) and this one。
更新
针对相同大小的包裹优化单个操作
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static Vector4DHW Substract(ref Vector4DHW left, ref Vector4DHW right)
{
var left1 = Unsafe.As<Vector4DHW, Vector256<double>>(ref left);
var right1 = Unsafe.As<Vector4DHW, Vector256<double>>(ref right);
var result = Avx.Subtract(left1, right1);
return Unsafe.As<Vector256<double>, Vector4DHW>(ref result);
}
让我们进行基准测试
class Program
{
static void Main()
{
var summary = BenchmarkRunner.Run<MyBenchmark>();
Console.ReadKey();
}
}
[StructLayout(LayoutKind.Sequential)]
public struct Vector4DHW
{
public double X;
public double Y;
public double Z;
public double W;
public Vector4DHW(double x, double y, double z, double w)
{
X = x;
Y = y;
Z = z;
W = w;
}
}
public class MyBenchmark
{
private Vector4DHW vector1 = new Vector4DHW(1.5, 2.5, 3.5, 4.5);
private Vector4DHW vector2 = new Vector4DHW(0.1, 0.2, 0.3, 0.4);
[Benchmark]
public void Loop()
{
for (int i = 0; i < 1000000; i++)
{
var j = i;
}
}
[Benchmark]
public void Substract()
{
for (int i = 0; i < 1000000; i++)
{
var result = Substract(ref vector1, ref vector2);
}
}
[Benchmark]
public void SubstractAvx()
{
for (int i = 0; i < 1000000; i++)
{
var result = SubstractAvx(ref vector1, ref vector2);
}
}
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static Vector4DHW Substract(ref Vector4DHW left, ref Vector4DHW right)
{
return new Vector4DHW(left.X - right.X, left.Y - right.Y, left.Z - right.Z, left.W - right.W);
}
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static Vector4DHW SubstractAvx(ref Vector4DHW left, ref Vector4DHW right)
{
var left1 = Unsafe.As<Vector4DHW, Vector256<double>>(ref left);
var right1 = Unsafe.As<Vector4DHW, Vector256<double>>(ref right);
var result = Avx.Subtract(left1, right1);
return Unsafe.As<Vector256<double>, Vector4DHW>(ref result);
}
}
去吧!
BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19042
Intel Core i7-4700HQ CPU 2.40GHz (Haswell), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.102
[Host] : .NET Core 5.0.2 (CoreCLR 5.0.220.61120, CoreFX 5.0.220.61120), X64 RyuJIT
DefaultJob : .NET Core 5.0.2 (CoreCLR 5.0.220.61120, CoreFX 5.0.220.61120), X64 RyuJIT
| Method | Mean | Error | StdDev |
|------------- |-----------:|--------:|--------:|
| Loop | 317.6 us | 1.36 us | 1.21 us |
| Substract | 1,427.0 us | 4.14 us | 3.46 us |
| SubstractAvx | 478.0 us | 1.58 us | 1.40 us |
总而言之,当您试图在性能上节省几微秒时,内存优化非常重要。甚至 Stack 分配也很重要,无论其闪电般的速度如何。最后,for
循环开销消耗了 478
微秒的大量时间。这就是我单独测量 Loop
开销的原因。
让我们计算一下 AVX 的性能增益。
1,427.0 - 317.6 = 1109.4
478.0 - 317.6 = 160.4
1109.4 / 160.4 = 6.92
AVX 几乎快 7 倍。
更新2
也测试这个
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public unsafe static Vector4DHW Substract(Vector4DHW left, Vector4DHW right)
{
var result = Avx.Subtract(*(Vector256<double>*)&left, *(Vector256<double>*)&right);
return *(Vector4DHW*)&result;
}
@aepot 好吧,我考虑了你的评论并试了一下,这是我的结果:
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static Vector4DHW Subtract(Vector4DHW left, Vector4DHW right)
{
var left1 = Vector256.Create(left.X, left.Y, left.Z, 0);
var right1 = Vector256.Create(right.X, right.Y, right.Z, 0);
var result = Avx2.Subtract(left1, right1);
return new Vector4DHW(result.GetElement(0), result.GetElement(1),
result.GetElement(2), result.GetElement(3));
}
如果我这样使用它,我会收到 ~2470 us
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static unsafe Vector4DHW Subtract(Vector4DHW left, Vector4DHW right)
{
var left1 = Vector256.Create(left.X, left.Y, left.Z, left.W);
var right1 = Vector256.Create(right.X, right.Y, right.Z, left.W);
var result = Avx2.Subtract(left1, right1);
double* value = stackalloc double[4];
Avx2.Store(value, result);
return new Vector4DHW(value[0], value[1], value[2], value[3]);
}
这个变体给我 ~2089 us
即使我像这样使我的方法无效并且根本不进行任何转换:
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static void Subtract(Vector4DHW left, Vector4DHW right)
{
var left1 = Vector256.Create(left.X, left.Y, left.Z, left.W);
var right1 = Vector256.Create(right.X, right.Y, right.Z, left.W);
var result = Avx2.Subtract(left1, right1);
}
它会给我 ~990 us。 这是内部函数更快的唯一情况,但我需要让用户有可能查看计算结果
纯 Vector4D 的计算接近 1990 - 2000 us
因此,我看不到使用内部函数进行此类计算有任何好处。也许在某些情况下会更快,但我认为每种情况都应该单独考虑