找到一个点属于哪个三角形的优化技巧
Optimisation tips to find in which triangle a point belongs
实际上我在优化我的算法时遇到了一些麻烦:
我有一个圆盘(以 0 为中心,半径为 1)充满三角形(不一定相同 area/length)。可能有大量的三角形(假设从 1k
到 300k
个三角形)
我的目标是尽快找到一个点属于哪个三角形。
操作必须重复大量时间(大约10k
次)。
目前我使用的算法是:计算每个三角形中点的重心坐标。如果第一个系数在 0 和 1 之间,我继续。如果不是,我就停下来。然后我用同样的想法计算第二个系数,第三个,我对每个三角形都这样做。
我想不出一种方法来利用我在光盘上工作的事实(以及我有欧几里得距离的事实来帮助我 "target" 直接得到好的三角形):如果我尝试计算从我的点到每个 "center" 个三角形的距离:
1)它已经比我用重心坐标暴力破解时做的操作多了
2) 我将不得不订购一个向量,其中包含所有三角形到我的点的欧几里德距离。
3)我完全不能保证离我点最近的三角形就是好的三角形。
我觉得我错过了一些东西,我可以预先计算一些东西来帮助我在开始蛮力部分之前发现好的东西"area"。
该算法已经并行化(使用 OpenMP):我正在并行调用以下函数。
bool Triangle2D::is_in_triangle(Vector2d Point) {
double denominator = ((Tri2D(1, 1) - Tri2D(2, 1))*(Tri2D(0, 0) - Tri2D(2, 0)) + (Tri2D(2, 0) - Tri2D(1, 0))*(Tri2D(0, 1) - Tri2D(2, 1)));
// Computing the first coefficient
double a = ((Tri2D(1, 1) - Tri2D(2, 1))*(Point(0) - Tri2D(2, 0)) + (Tri2D(2, 0) - Tri2D(1, 0))*(Point(1) - Tri2D(2, 1))) / denominator;
if (a < 0 || a>1) {
return(false);
}
// Computing the second coefficient
double b = ((Tri2D(2, 1) - Tri2D(0, 1))*(Point(0) - Tri2D(2, 0)) + (Tri2D(0, 0) - Tri2D(2, 0))*(Point(1) - Tri2D(2, 1))) / denominator;
if (b < 0 || b>1) {
return(false);
}
// Computing the third coefficient
double c = 1 - a - b;
if (c < 0 || c>1) {
return(false);
}
return(true);
}
下一步可能是研究 GPU 并行化,但我需要确保代码背后的想法足够好。
目前 75k
个三角形和 10k
个点大约需要 2min30
,但这还不够快。
编辑:
Triangle2D
使用特征矩阵存储坐标
我想你可以为每个三角形制作一个边界数组:顶部、底部、右侧和左侧极端。然后,将您的观点与这些边界进行比较。如果它落在一个内,那么看看它是否真的在三角形内。这样,99.9% 的情况不涉及双 floating-point 乘法和一些加法——只是比较。仅当该点位于三角形的直线极值内时才进行计算量大的操作。
这可以进一步加快,例如例如排序三角形的最顶端,并使用二进制搜索;然后首先找到三角形下方的最高点,然后检查三角形上方的最高点。这样,您只需检查一半多一点。如果三角形极值的高度有一个上限,你可以检查得少得多。请注意,后一种策略会使您的源代码更加复杂 - 因此这将决定您希望为优化多少结果投入多少努力。第一部分似乎相当简单,而且帮助很大。排序列表:只需付出几乎一半的操作即可付出更多努力。我先看看第一种策略是否适合你。
使用 boost 的二次 rtree 找到最近的三角形完美地完成了这项工作。该算法现在不到一分钟就 运行(对于 75k 三角形和 100k 点(多 10 倍!))
我最终通过在每个三角形中放入一个盒子来构建一棵树,并寻找一个点的 Knn 并测试相应的三角形:)
预计进入像空间数据库这样的新领域会遇到更多麻烦,但 Boost 确实是一个疯狂的库
所有 long-bearded HPC-professionals,请允许在这里使用一些学术性的详细方法,这可能(在我看来)变得有趣,如果不是的话对于我们的社区成员来说,他们觉得自己比您在专业上感觉自己更年轻,他们可能会对更深入地了解 performance-motivated code-design、性能调整和其他 parallel-code 风险和收益,你自己知道 hard-core HPC-computing 体验得那么好,那么深。谢谢。
a) 算法 (as-is) 可以获得 ~2X 加速 a low-hanging fruit+more surprise yet2come
b) 由于 2geometry
,其他算法可能会获得 ~40~80 倍的加速提升
c) 获得最佳并行代码 + 终极性能的技巧
GOAL:10k
点的目标运行时间300k
三角形 在配备 i5 8500、3GHz、6 核、NVIDIA Quadro P400 的计算机上将是 2-3min(必须尝试 GPU 计算,甚至不确定是否值得)
虽然这似乎是一段漫长的旅程,但问题很好,值得仔细研究,所以请在 utmost-performance 积极思考的过程中耐心等待。
a) 算法 (as-is) 分析:
重心坐标系的as-is使用是一个很好的技巧,在最好的情况下,直接实现它的成本略高于 (20 FLOPs + 16 MEM/REG-I/O-ops),略高于 ( 30 次 FLOPs + 30 MEM/REG-I/O-ops).
有一些润色,可以通过避免一些昂贵甚至不重要的操作的发生来降低这些执行成本:
--------------------------------------------------------------------------------------
double denominator = ( ( Tri2D( 1, 1 )
- Tri2D( 2, 1 ) // -------------------------- 2x MEM + OP-1.SUB
) * ( Tri2D( 0, 0 ) //--------------------- + OP-3.MUL
- Tri2D( 2, 0 ) //--------------------- 2x MEM + OP-2.SUB
) + ( Tri2D( 2, 0 ) //--------------- + OP-7.ADD
- Tri2D( 1, 0 ) //--------------- 2x MEM + OP-4.SUB
) * ( Tri2D( 0, 1 ) //--------- + OP-6.MUL
- Tri2D( 2, 1 ) //--------- 2x MEM + OP-5.SUB
)
);
// Computing the first coefficient ------------------------------------------------------------------------------------------------------
double a = ( ( Tri2D( 1, 1 )
- Tri2D( 2, 1 ) //-------------------------- 2x MEM + OP-8.SUB
) * ( Point(0) //------------------------ + OP-A.MUL
- Tri2D( 2, 0 ) //--------------------- 2x MEM + OP-9.SUB
) + ( Tri2D( 2, 0 ) //--------------- + OP-E.ADD
- Tri2D( 1, 0 ) //--------------- 2x MEM + OP-B.SUB
) * ( Point(1) //-------------- + OP-D.MUL
- Tri2D( 2, 1 ) //--------- 2x MEM + OP-C.MUL
)
) / denominator; //-------------------------- 1x REG + OP-F.DIV //----------- MAY DEFER THE MOST EXPENSIVE DIVISION UNTIL a third coeff is indeed first needed, if ever------------[3]
if (a < 0 || a>1) { // ----------------------------------------------------------------------------- a < 0 ~~ ( sign( a ) * sign( denominator ) ) < 0
return(false); // ------------------------------------------------------------------------------ a > 1 ~~ || a > denominator
}
// Computing the second coefficient
double b = ( ( Tri2D( 2, 1 ) - Tri2D( 0, 1 ) ) //--------- 2x MEM + OP-16.SUB
* ( Point(0) - Tri2D( 2, 0 ) ) //--------- 2x MEM + OP-17.SUB + OP-18.MUL
+ ( Tri2D( 0, 0 ) - Tri2D( 2, 0 ) ) //--------- 2x MEM + OP-19.SUB + OP-22.ADD
* ( Point(1) - Tri2D( 2, 1 ) ) //--------- 2x MEM + OP-20.SUB + OP-21.MUL
) / denominator; //-------------------------- 1x REG + OP-23.DIV //---------- MAY DEFER THE MOST EXPENSIVE DIVISION UNTIL a third coeff is indeed first needed, if ever -----------[3]
if (b < 0 || b>1) { // ----------------------------------------------------------------------------- b < 0 ~~ ( sign( b ) * sign( denominator ) ) < 0
return(false); // ------------------------------------------------------------------------------ b > 1 ~~ || b > denominator
}
// Computing the third coefficient
double c = 1 - a - b; // ------------------------------------------- 2x REG + OP-24.SUB + OP-25.SUB
// 1 -(a - b)/denominator; //--------------------------------------------------------------- MAY DEFER THE MOST EXPENSIVE DIVISION EXECUTED BUT HERE, IFF INDEED FIRST NEEDED <---HERE <----------[3]
repeated re-evaluations,出现在原文中的内容可能会被手动明确制作出来 assign/re-use,但是,一个好的优化编译器有可能将这些内容逐出使用 -O3
强制优化标志。
do not hesitate to profile连这个lowest-hanging水果,把最贵的部分打磨一下
//------------------------------------------------------------------
double Tri2D_11_sub_21 = ( Tri2D( 1, 1 )
- Tri2D( 2, 1 )
), //====================================================== 2x MEM + OP-a.SUB (REG re-used 2x)
Tri2D_20_sub_10 = ( Tri2D( 2, 0 )
- Tri2D( 1, 0 )
), //====================================================== 2x MEM + OP-b.SUB (REG re-used 2x)
Tri2D_00_sub_20 = ( Tri2D( 0, 0 )
- Tri2D( 2, 0 )
); //====================================================== 2x MEM + OP-c.SUB (REG re-used 1~2x)
//-----------------------
double denominator = ( ( /*
Tri2D( 1, 1 )
- Tri2D( 2, 1 ) // -------------------------- 2x MEM + OP-1.SUB (avoided by re-use) */
Tri2D_11_sub_21 //=========================================== 1x REG + OP-d.MUL
) * ( /*
Tri2D( 0, 0 ) //--------------------- + OP-3.MUL
- Tri2D( 2, 0 ) //--------------------- 2x MEM + OP-2.SUB (avoided by re-use) */
Tri2D_00_sub_20 //===================================== 1x REG + OP-f.ADD
) + ( /*
Tri2D( 2, 0 ) //--------------- + OP-7.ADD
- Tri2D( 1, 0 ) //--------------- 2x MEM + OP-4.SUB (avoided by re-use) */
Tri2D_20_sub_10 //=============================== 1x REG + OP-e.MUL
) * ( Tri2D( 0, 1 ) //--------- + OP-6.MUL
- Tri2D( 2, 1 ) //--------- 2x MEM + OP-5.SUB
)
);
//\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/
//
// Computing the first coefficient ------------------------------------------------------------------------------------------------------
//
double enumer_of_a = ( ( /*
Tri2D( 1, 1 )
- Tri2D( 2, 1 ) //-------------------------- 2x MEM + OP-8.SUB (avoided by re-use) */
Tri2D_11_sub_21 //=========================================== 1x REG + OP-g.MUL
) * ( Point(0) //------------------------------------------ + OP-i.MUL
- Tri2D( 2, 0 ) //--------------------------------------- 2x MEM + OP-h.SUB
) + ( /*
Tri2D( 2, 0 ) //--------------- + OP-E.ADD
- Tri2D( 1, 0 ) //--------------- 2x MEM + OP-B.SUB (avoided by re-use) */
Tri2D_20_sub_10 //=============================== 1x REG + OP-l.ADD
) * ( Point(1) //-------------------------------- + OP-k.MUL
- Tri2D( 2, 1 ) //--------------------------- 2x MEM + OP-j.MUL
)
);/*denominator; *///------------------------ 1x REG + OP-F.DIV (avoided by DEFERRAL THE MOST EXPENSIVE DIVISION UNTIL a third coeff is indeed first needed, if ever-----------[3]
//~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~J.I.T./non-MISRA-C-RET-->
// TEST CONDITIONS FOR A CHEAPEST EVER J.I.T./non-MISRA-C-RET-->
//
if ( enumer_of_a > denominator // in a > 1, THE SIZE DECIDES, the a / denominator > 1, iff enumer_of_a > denominator a rather expensive .FDIV is avoided at all
|| enumer_of_a * denominator < 0 ) return(false); // in a < 0, THE SIGN DECIDES, not the VALUE matters, so will use a cheaper .FMUL, instead of a rather expensive .FDIV ~~ ( sign( a ) * sign( denominator ) ) < 0
//\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/
//
// Computing the second coefficient
//
double enumer_of_b = ( ( Tri2D( 2, 1 ) - Tri2D( 0, 1 ) ) //---------------------------------------- 2x MEM + OP-m.SUB
* ( Point(0) - Tri2D( 2, 0 ) ) //---------------------------------------- 2x MEM + OP-n.SUB + OP-o.MUL
+ ( /*
Tri2D( 0, 0 ) - Tri2D( 2, 0 ) //--------- 2x MEM + OP-19.SUB + OP-22.ADD (avoided by re-use) */
Tri2D_00_sub_20 //======================================================== 1x REG + OP-p.ADD
)
* ( Point(1) - Tri2D( 2, 1 ) ) //---------------------------------------- 2x MEM + OP-r.SUB + OP-q.MUL
);/*denominator; *///------------------------ 1x REG + OP-23.DIV (avoided by DEFERRAL THE MOST EXPENSIVE DIVISION UNTIL a third coeff is indeed first needed, if ever-----------[3]
//~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~J.I.T./non-MISRA-C-RET-->
// TEST CONDITIONS FOR A 2nd CHEAPEST J.I.T./non-MISRA-C-RET-->
//
if ( enumer_of_b > denominator // in b > 1, THE SIZE DECIDES, the a / denominator > 1, iff enumer_of_a > denominator a rather expensive .FDIV is avoided at all
|| enumer_of_b * denominator < 0 ) return(false); // in b < 0, THE SIGN DECIDES, not the VALUE matters, so will use a cheaper .FMUL, instead of a rather expensive .FDIV ~~ ( sign( a ) * sign( denominator ) ) < 0
//\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/
//
// Computing the third coefficient
//
double c = 1 - ( ( enumer_of_a
- enumer_of_b
)
/ denominator
); // --------------------------------------------- 3x REG + OP-s.SUB + OP-t.FDIC + OP-u.SUB <----THE MOST EXPENSIVE .FDIV BUT HERE, IFF INDEED FIRST NEEDED <---HERE <------------[3]
//~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~J.I.T./non-MISRA-C-RET-->
// TEST CONDITIONS FOR PRE-FINAL RET J.I.T./non-MISRA-C-RET-->
//
if ( c < 0 || c > 1 ) return( false );
return( true ); //~~~~~~~~~~~~~~~~~~~ "the-last-resort" RET-->
b) 算法的其他方法:
让我们回顾一下另一种方法,这种方法似乎既更快又更便宜,因为数据更少,指令更少,而且一旦智能使用矢量化 AVX-2 或更好的 AVX-512,它也有望具有高密度vector-instructions 将针对每个核心进行利用:ANIMATED, fully-INTERACTIVE explanation is altogether with analytical problem re-formulation here.
triple-test 的 point-to-line 距离(每条线是 ax + by + c = 0
)的成本很低~ 2 FMA3
就足够了 + 符号测试(如果 vector-4-in-1-compact AVX-2 / 8-in-1 AVX-512 VFMADD
-s 甚至更好)
虽然可以 "fast" 决定针对相应的三角形测试一个点是否有意义,但可能需要 中的每个三角形静态预 "framing"polar(R,Theta)
坐标 space 由 ( R_min, R_max, Theta_min, Theta_max )
和 "fast" 的静态预计算元组区分每个点,如果它不适合这样的 polar-segment .然而,这样做的成本(data-access 模式成本 + 这些 "fast" 指令的成本)将超出任何潜在的 "saved" instruction-paths,这不需要发生(如果point 位于 polar-segment 之外)。在达到 24 points-in-triangle 测试的性能范围后,每 6 CPU-cores @ 3.0+ GHz 的成本约为 9 CPU-instructions,polar-segment "pre-testing" 将突然变得非常昂贵,而不是谈论二阶负面影响(通过更糟糕的 cache-hit / cache-miss 比率引入,因为要存储更多数据并将其读入 "fast"-pre-test ~ +16B per a triangle "framing" polar-segment tuple +8B per point (对缓存的影响最差hit/miss-ratio).
这显然不是任何进一步行动的好方向,因为性能会下降,而不是上升,这是我们这里的全球战略。
Intel i5 8500 CPU 可以使用但 AVX-2,因此如果需要,每个内核每个 CPU-clock 滴答 8 个三角形的最紧凑使用可以实现甚至高出 2 倍的性能。
TRIPLE-"point-above-line"-TEST per POINT per TRIANGLE:
---------------------------------------------------------------------------------
PRE-COMPUTE STATIC per TRIANGLE CONSTANTS:
LINE_1: C0__L1, C1__L1, C2__L1, bool_L1DistanceMustBePOSITIVE
LINE_2: C0__L2, C1__L2, C2__L2, bool_L2DistanceMustBePOSITIVE
LINE_3: C0__L3, C1__L3, C2__L3, bool_L3DistanceMustBePOSITIVE
TEST per TRIANGLE per POINT (Px,Py) - best executed in an AVX-vectorised fashion
LINE_1_______________________________________________________________
C0__L1 ( == pre-calc'd CONST = c1 / sqrt( a1^2 + b1^2 ) ) //
Px * C1__L1 ( == pre-calc'd CONST = a1 / sqrt( a1^2 + b1^2 ) ) // OP-1.FMA REG-Px,C1__L1,C0__L1
Py * C2__L1 ( == pre-calc'd CONST = b1 / sqrt( a1^2 + b1^2 ) ) // OP-2.FMA REG-Py,C2__L1, +
.GT./.LT. 0 // OP-3.SIG
LINE_2_______________________________________________________________
C0__L2 ( == pre-calc'd CONST = c2 / sqrt( a2^2 + b2^2 ) ) //
Px * C1__L2 ( == pre-calc'd CONST = a2 / sqrt( a2^2 + b2^2 ) ) // OP-4.FMA REG-Px,C1__L2,C0__L2
Py * C2__L2 ( == pre-calc'd CONST = b2 / sqrt( a2^2 + b2^2 ) ) // OP-5.FMA REG-Py,C2__L2, +
.GT./.LT. 0 // OP-6.SIG
LINE_3_______________________________________________________________
C0__L3 ( == pre-calc'd CONST = c3 / sqrt( a3^2 + b3^2 ) ) //
Px * C1__L3 ( == pre-calc'd CONST = a3 / sqrt( a3^2 + b3^2 ) ) // OP-7.FMA REG-Px,C1__L3,C0__L3
Py * C2__L3 ( == pre-calc'd CONST = b3 / sqrt( a3^2 + b3^2 ) ) // OP-8.FMA REG-Py,C2__L3, +
.GT./.LT. 0 // OP-9.SIG
( using AVX-2 intrinsics or inlined assembler will deliver highest performance due to COMPACT 4-in-1 VFMADDs )
____________________________________________
| __________________________________________triangle A: C1__L1
| | ________________________________________triangle B: C1__L1
| | | ______________________________________triangle C: C1__L1
| | | | ____________________________________triandle D: C1__L1
| | | | |
| | | | | ______________________________
| | | | | | ____________________________triangle A: Px
| | | | | | | __________________________triangle B: Px
| | | | | | | | ________________________triangle C: Px
| | | | | | | | | ______________________triandle D: Px
| | | | | | | | | |
|1|2|3|4| | | | | |
| | | | | |1|2|3|4| ________________
| | | | | | | | | | | ______________triangle A: C0__L1
| | | | | | | | | | | | ____________triangle B: C0__L1
| | | | | | | | | | | | | __________triangle C: C0__L1
| | | | | | | | | | | | | | ________triandle D: C0__L1
| | | | | | | | | | | | | | |
|1|2|3|4| | | | | | | | | | |
| | | | | |1|2|3|4| | | | | |
| | | | | | | | | | |1|2|3|4|
(__m256d) __builtin_ia32_vfmaddpd256 ( (__v4df )__A, (__v4df )__B, (__v4df )__C ) ~/ per CPU-core @ 3.0 GHz ( for actual uops durations check Agner or Intel CPU documentation )
can
perform 4-( point-in-triangle ) PARALLEL-test in just about ~ 9 ASSEMBLY INSTRUCTIONS / per CPU-core @ 3.0 GHz
24-( point-in-triangle ) PARALLEL-test in just about ~ 9 ASSEMBLY INSTRUCTIONS / per CPU
using AVX-512 empowered CPU, can use 8-in-1 VFMADDs
could
perform 8-( point-in-triangle ) PARALLEL-test in just about ~ 9 ASSEMBLY INSTRUCTIONS / per CPU-core @ 3.0 GHz
48-( point-in-triangle ) PARALLEL-test in just about ~ 9 ASSEMBLY INSTRUCTIONS / per CPU
c) 获得最佳并行代码 + 终极性能的技巧:
步骤 -1: GPU / CUDA 成本 v/s 收益
如果您的 PhD-mentor、教授、老板或项目经理确实坚持要您为这个问题开发一个 GPU-computing c++/CUDA 代码解决方案,那么下一步最好是要求获得任何比您发布的任务更适合 GPU-card 这样的任务。
您指定的卡,Q400 GPU has just 2 SMX (48KB L1-cache each) 不太适合真正意义上的并行 CUDA-computing,实际执行任何 SIMD-thread-block 计算的处理设备少了大约 30 倍,更不用说它的小内存和微小的 L1 / L2 on-SMX-caches。因此,在所有 CUDA-related 设计和优化工作之后,GPU-SMX warp32
-wide thread-block 中的执行不会更多,而是一对(!) =228=],所以这个基于 GP107 的设备没有大马戏团,并且有 30+X-better 个配备了一些确实高性能 parallel-processing 可用 COTS 的设备)
第 0 步: pre-compute 和 pre-arrange 数据(MAXIMISE cache-line COHERENCY):
在这里,优化算法的最佳宏观循环是有意义的,这样您 "benefit" 大部分来自缓存命中(即 "fast" 数据的最佳 re-use,pre-fetched 已经 )
因此,可以测试一下,如果将工作分配到 N-concurrent-workers 上 cache-wise 更快,谁处理三角形的分离 working-blocks,其中每个循环遍历最小的 memory-area --- 所有点(~ 10k * 2 * 4B ~ 80 kB),然后进入 working-block 中的下一个三角形。确保 row-first 数组对齐到 memory-area 是至关重要的(对于 HPC-fast compact/vectorised vector-algebra 和 matrix-processing )
好处是什么?
缓存的系数将re-used ~ 10k次(代价是~ 0.5~1.0 [ns]
,而不是re-fetching + 100 ~ 300 [ns]
的成本(如果这些必须通过 RAM 内存访问 re-read)。差异,关于 ~ 200X ~ 600X
值得努力使工作流程 sub-ordinated 与 data-access 模式和 cache-line 资源保持最佳对齐。
结果是原子的 - 任何点都属于一个且仅一个 (non-overlapping) 个三角形。
这简化了先验 non-colliding 并相对稀疏地写入结果向量,任何检测到的向量 point-inside-triangle 都可以自由报告找到的此类三角形的索引。
使用此结果向量潜在地避免任何 re-testing 点,其中已经执行了匹配(索引为 non-negative )不是很有效,因为 [=248 的成本=]ing 此类指示和 re-arranging 4 合 1 或 8 合 1 point-in-triangle 测试的紧凑对齐,对于潜在 "savings" 而非 re-testing 一个已经映射的点。
所以 10k
点在一块 300k
三角形上可能会产生一个工作流:
拆分一些 300k / 6
个核心
~ 50k
个三角形/1 个核心
~ 50k * 10 k
point-in-triangle 每个核心测试
~ 500M
point-in-triangle 每个核心测试 @ 3.0+ GHz
~ 125M AVX-2
vector-4-in -1-compact test-executions per-core
~ 125M
tests x 10 uops
-instructions @ 3.0 GHz
.. .
哪个是1.25G uops @ 3.0+ GHz
...秒?是的,有一种非常强烈的动机要走到这里,朝着最终的表现前进,并以这种方式指导进一步的工作。
所以,我们在这里:
在 3.0+ GHz 的 6 核 AVX-2 i5 8500 上,300+k 三角形/10+k 点的主要可实现 HPC-target 在几秒的范围内
值得努力,不是吗?
Quad trees are nice if you have to do lots of single queries, but if you've got 10k to do all at once there is an algorithm purpose-built for this: The sweep line. I'm getting query times of less than 5 seconds on similar data in a single thread.
We're going to take a vertical line and sweep it from left to right across your 2d plane, and we're going to keep track of which triangles that line is intersecting at any point in time.
So, whenever your line intersects one of your query points as it's sweeping, you only ever have to check the triangles that your vertical line is overlapping. No other triangles can contain that point.
I kind of hate the name "sweepline" in this case because it gives the mental image of smoothly traversing the plane, which it doesn't do. It just jumps to the next position of interest in order from left to right.
Depending on the ratio of triangles to queries, you could also put the overlapping triangles in a segment tree or interval tree by their y-coordinates, but this was actually slower for my data (or my implementation was bad. Also possible). Definitely worth a try though, especially if you can put off balancing the tree until its needed.
The setup time is just 3 sorts:
- Get a vector of pointers to your triangles sorted by the minimum x-value of the tri
- Get a vector of pointers to your triangles sorted by the maximum x-value of the tri
- Get a vector of pointers to your points sorted by their x-value
Then we sweep:
- Initialize the "current positions" for each of your 3 sorted lists to 0
- Look for the lowest x-value at the current position in your 3 sorted lists. This will be the next x-position of your sweepline
- If it's the leftmost point of a triangle, then add that triangle to your "overlapping" vector
- If it's a query point, then check that point against all the currently overlapping tris
- If it's the rightmost point of a triangle, then remove it from the overlapping vector.
- increment the current position for that list
- Loop until you're out of points, or out of tris.
Why is this better than a quadtree? Because we're always keeping track of the tris currently intersecting our sweepline. This means we're re-using data between queries, and quadtrees don't really do that. Quadtrees will have much better single-query performance, but this kind of bulk lookup is made for the sweepline.
实际上我在优化我的算法时遇到了一些麻烦:
我有一个圆盘(以 0 为中心,半径为 1)充满三角形(不一定相同 area/length)。可能有大量的三角形(假设从 1k
到 300k
个三角形)
我的目标是尽快找到一个点属于哪个三角形。
操作必须重复大量时间(大约10k
次)。
目前我使用的算法是:计算每个三角形中点的重心坐标。如果第一个系数在 0 和 1 之间,我继续。如果不是,我就停下来。然后我用同样的想法计算第二个系数,第三个,我对每个三角形都这样做。
我想不出一种方法来利用我在光盘上工作的事实(以及我有欧几里得距离的事实来帮助我 "target" 直接得到好的三角形):如果我尝试计算从我的点到每个 "center" 个三角形的距离:
1)它已经比我用重心坐标暴力破解时做的操作多了
2) 我将不得不订购一个向量,其中包含所有三角形到我的点的欧几里德距离。
3)我完全不能保证离我点最近的三角形就是好的三角形。
我觉得我错过了一些东西,我可以预先计算一些东西来帮助我在开始蛮力部分之前发现好的东西"area"。
该算法已经并行化(使用 OpenMP):我正在并行调用以下函数。
bool Triangle2D::is_in_triangle(Vector2d Point) {
double denominator = ((Tri2D(1, 1) - Tri2D(2, 1))*(Tri2D(0, 0) - Tri2D(2, 0)) + (Tri2D(2, 0) - Tri2D(1, 0))*(Tri2D(0, 1) - Tri2D(2, 1)));
// Computing the first coefficient
double a = ((Tri2D(1, 1) - Tri2D(2, 1))*(Point(0) - Tri2D(2, 0)) + (Tri2D(2, 0) - Tri2D(1, 0))*(Point(1) - Tri2D(2, 1))) / denominator;
if (a < 0 || a>1) {
return(false);
}
// Computing the second coefficient
double b = ((Tri2D(2, 1) - Tri2D(0, 1))*(Point(0) - Tri2D(2, 0)) + (Tri2D(0, 0) - Tri2D(2, 0))*(Point(1) - Tri2D(2, 1))) / denominator;
if (b < 0 || b>1) {
return(false);
}
// Computing the third coefficient
double c = 1 - a - b;
if (c < 0 || c>1) {
return(false);
}
return(true);
}
下一步可能是研究 GPU 并行化,但我需要确保代码背后的想法足够好。
目前 75k
个三角形和 10k
个点大约需要 2min30
,但这还不够快。
编辑:Triangle2D
使用特征矩阵存储坐标
我想你可以为每个三角形制作一个边界数组:顶部、底部、右侧和左侧极端。然后,将您的观点与这些边界进行比较。如果它落在一个内,那么看看它是否真的在三角形内。这样,99.9% 的情况不涉及双 floating-point 乘法和一些加法——只是比较。仅当该点位于三角形的直线极值内时才进行计算量大的操作。
这可以进一步加快,例如例如排序三角形的最顶端,并使用二进制搜索;然后首先找到三角形下方的最高点,然后检查三角形上方的最高点。这样,您只需检查一半多一点。如果三角形极值的高度有一个上限,你可以检查得少得多。请注意,后一种策略会使您的源代码更加复杂 - 因此这将决定您希望为优化多少结果投入多少努力。第一部分似乎相当简单,而且帮助很大。排序列表:只需付出几乎一半的操作即可付出更多努力。我先看看第一种策略是否适合你。
使用 boost 的二次 rtree 找到最近的三角形完美地完成了这项工作。该算法现在不到一分钟就 运行(对于 75k 三角形和 100k 点(多 10 倍!))
我最终通过在每个三角形中放入一个盒子来构建一棵树,并寻找一个点的 Knn 并测试相应的三角形:) 预计进入像空间数据库这样的新领域会遇到更多麻烦,但 Boost 确实是一个疯狂的库
所有 long-bearded HPC-professionals,请允许在这里使用一些学术性的详细方法,这可能(在我看来)变得有趣,如果不是的话对于我们的社区成员来说,他们觉得自己比您在专业上感觉自己更年轻,他们可能会对更深入地了解 performance-motivated code-design、性能调整和其他 parallel-code 风险和收益,你自己知道 hard-core HPC-computing 体验得那么好,那么深。谢谢。
a) 算法 (as-is) 可以获得 ~2X 加速 a low-hanging fruit+more surprise yet2come
b) 由于 2geometry
,其他算法可能会获得 ~40~80 倍的加速提升c) 获得最佳并行代码 + 终极性能的技巧
GOAL:10k
点的目标运行时间300k
三角形 在配备 i5 8500、3GHz、6 核、NVIDIA Quadro P400 的计算机上将是 2-3min(必须尝试 GPU 计算,甚至不确定是否值得)
虽然这似乎是一段漫长的旅程,但问题很好,值得仔细研究,所以请在 utmost-performance 积极思考的过程中耐心等待。
a) 算法 (as-is) 分析:
重心坐标系的as-is使用是一个很好的技巧,在最好的情况下,直接实现它的成本略高于 (20 FLOPs + 16 MEM/REG-I/O-ops),略高于 ( 30 次 FLOPs + 30 MEM/REG-I/O-ops).
有一些润色,可以通过避免一些昂贵甚至不重要的操作的发生来降低这些执行成本:
--------------------------------------------------------------------------------------
double denominator = ( ( Tri2D( 1, 1 )
- Tri2D( 2, 1 ) // -------------------------- 2x MEM + OP-1.SUB
) * ( Tri2D( 0, 0 ) //--------------------- + OP-3.MUL
- Tri2D( 2, 0 ) //--------------------- 2x MEM + OP-2.SUB
) + ( Tri2D( 2, 0 ) //--------------- + OP-7.ADD
- Tri2D( 1, 0 ) //--------------- 2x MEM + OP-4.SUB
) * ( Tri2D( 0, 1 ) //--------- + OP-6.MUL
- Tri2D( 2, 1 ) //--------- 2x MEM + OP-5.SUB
)
);
// Computing the first coefficient ------------------------------------------------------------------------------------------------------
double a = ( ( Tri2D( 1, 1 )
- Tri2D( 2, 1 ) //-------------------------- 2x MEM + OP-8.SUB
) * ( Point(0) //------------------------ + OP-A.MUL
- Tri2D( 2, 0 ) //--------------------- 2x MEM + OP-9.SUB
) + ( Tri2D( 2, 0 ) //--------------- + OP-E.ADD
- Tri2D( 1, 0 ) //--------------- 2x MEM + OP-B.SUB
) * ( Point(1) //-------------- + OP-D.MUL
- Tri2D( 2, 1 ) //--------- 2x MEM + OP-C.MUL
)
) / denominator; //-------------------------- 1x REG + OP-F.DIV //----------- MAY DEFER THE MOST EXPENSIVE DIVISION UNTIL a third coeff is indeed first needed, if ever------------[3]
if (a < 0 || a>1) { // ----------------------------------------------------------------------------- a < 0 ~~ ( sign( a ) * sign( denominator ) ) < 0
return(false); // ------------------------------------------------------------------------------ a > 1 ~~ || a > denominator
}
// Computing the second coefficient
double b = ( ( Tri2D( 2, 1 ) - Tri2D( 0, 1 ) ) //--------- 2x MEM + OP-16.SUB
* ( Point(0) - Tri2D( 2, 0 ) ) //--------- 2x MEM + OP-17.SUB + OP-18.MUL
+ ( Tri2D( 0, 0 ) - Tri2D( 2, 0 ) ) //--------- 2x MEM + OP-19.SUB + OP-22.ADD
* ( Point(1) - Tri2D( 2, 1 ) ) //--------- 2x MEM + OP-20.SUB + OP-21.MUL
) / denominator; //-------------------------- 1x REG + OP-23.DIV //---------- MAY DEFER THE MOST EXPENSIVE DIVISION UNTIL a third coeff is indeed first needed, if ever -----------[3]
if (b < 0 || b>1) { // ----------------------------------------------------------------------------- b < 0 ~~ ( sign( b ) * sign( denominator ) ) < 0
return(false); // ------------------------------------------------------------------------------ b > 1 ~~ || b > denominator
}
// Computing the third coefficient
double c = 1 - a - b; // ------------------------------------------- 2x REG + OP-24.SUB + OP-25.SUB
// 1 -(a - b)/denominator; //--------------------------------------------------------------- MAY DEFER THE MOST EXPENSIVE DIVISION EXECUTED BUT HERE, IFF INDEED FIRST NEEDED <---HERE <----------[3]
repeated re-evaluations,出现在原文中的内容可能会被手动明确制作出来 assign/re-use,但是,一个好的优化编译器有可能将这些内容逐出使用
-O3
强制优化标志。do not hesitate to profile连这个lowest-hanging水果,把最贵的部分打磨一下
//------------------------------------------------------------------
double Tri2D_11_sub_21 = ( Tri2D( 1, 1 )
- Tri2D( 2, 1 )
), //====================================================== 2x MEM + OP-a.SUB (REG re-used 2x)
Tri2D_20_sub_10 = ( Tri2D( 2, 0 )
- Tri2D( 1, 0 )
), //====================================================== 2x MEM + OP-b.SUB (REG re-used 2x)
Tri2D_00_sub_20 = ( Tri2D( 0, 0 )
- Tri2D( 2, 0 )
); //====================================================== 2x MEM + OP-c.SUB (REG re-used 1~2x)
//-----------------------
double denominator = ( ( /*
Tri2D( 1, 1 )
- Tri2D( 2, 1 ) // -------------------------- 2x MEM + OP-1.SUB (avoided by re-use) */
Tri2D_11_sub_21 //=========================================== 1x REG + OP-d.MUL
) * ( /*
Tri2D( 0, 0 ) //--------------------- + OP-3.MUL
- Tri2D( 2, 0 ) //--------------------- 2x MEM + OP-2.SUB (avoided by re-use) */
Tri2D_00_sub_20 //===================================== 1x REG + OP-f.ADD
) + ( /*
Tri2D( 2, 0 ) //--------------- + OP-7.ADD
- Tri2D( 1, 0 ) //--------------- 2x MEM + OP-4.SUB (avoided by re-use) */
Tri2D_20_sub_10 //=============================== 1x REG + OP-e.MUL
) * ( Tri2D( 0, 1 ) //--------- + OP-6.MUL
- Tri2D( 2, 1 ) //--------- 2x MEM + OP-5.SUB
)
);
//\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/
//
// Computing the first coefficient ------------------------------------------------------------------------------------------------------
//
double enumer_of_a = ( ( /*
Tri2D( 1, 1 )
- Tri2D( 2, 1 ) //-------------------------- 2x MEM + OP-8.SUB (avoided by re-use) */
Tri2D_11_sub_21 //=========================================== 1x REG + OP-g.MUL
) * ( Point(0) //------------------------------------------ + OP-i.MUL
- Tri2D( 2, 0 ) //--------------------------------------- 2x MEM + OP-h.SUB
) + ( /*
Tri2D( 2, 0 ) //--------------- + OP-E.ADD
- Tri2D( 1, 0 ) //--------------- 2x MEM + OP-B.SUB (avoided by re-use) */
Tri2D_20_sub_10 //=============================== 1x REG + OP-l.ADD
) * ( Point(1) //-------------------------------- + OP-k.MUL
- Tri2D( 2, 1 ) //--------------------------- 2x MEM + OP-j.MUL
)
);/*denominator; *///------------------------ 1x REG + OP-F.DIV (avoided by DEFERRAL THE MOST EXPENSIVE DIVISION UNTIL a third coeff is indeed first needed, if ever-----------[3]
//~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~J.I.T./non-MISRA-C-RET-->
// TEST CONDITIONS FOR A CHEAPEST EVER J.I.T./non-MISRA-C-RET-->
//
if ( enumer_of_a > denominator // in a > 1, THE SIZE DECIDES, the a / denominator > 1, iff enumer_of_a > denominator a rather expensive .FDIV is avoided at all
|| enumer_of_a * denominator < 0 ) return(false); // in a < 0, THE SIGN DECIDES, not the VALUE matters, so will use a cheaper .FMUL, instead of a rather expensive .FDIV ~~ ( sign( a ) * sign( denominator ) ) < 0
//\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/
//
// Computing the second coefficient
//
double enumer_of_b = ( ( Tri2D( 2, 1 ) - Tri2D( 0, 1 ) ) //---------------------------------------- 2x MEM + OP-m.SUB
* ( Point(0) - Tri2D( 2, 0 ) ) //---------------------------------------- 2x MEM + OP-n.SUB + OP-o.MUL
+ ( /*
Tri2D( 0, 0 ) - Tri2D( 2, 0 ) //--------- 2x MEM + OP-19.SUB + OP-22.ADD (avoided by re-use) */
Tri2D_00_sub_20 //======================================================== 1x REG + OP-p.ADD
)
* ( Point(1) - Tri2D( 2, 1 ) ) //---------------------------------------- 2x MEM + OP-r.SUB + OP-q.MUL
);/*denominator; *///------------------------ 1x REG + OP-23.DIV (avoided by DEFERRAL THE MOST EXPENSIVE DIVISION UNTIL a third coeff is indeed first needed, if ever-----------[3]
//~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~J.I.T./non-MISRA-C-RET-->
// TEST CONDITIONS FOR A 2nd CHEAPEST J.I.T./non-MISRA-C-RET-->
//
if ( enumer_of_b > denominator // in b > 1, THE SIZE DECIDES, the a / denominator > 1, iff enumer_of_a > denominator a rather expensive .FDIV is avoided at all
|| enumer_of_b * denominator < 0 ) return(false); // in b < 0, THE SIGN DECIDES, not the VALUE matters, so will use a cheaper .FMUL, instead of a rather expensive .FDIV ~~ ( sign( a ) * sign( denominator ) ) < 0
//\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/
//
// Computing the third coefficient
//
double c = 1 - ( ( enumer_of_a
- enumer_of_b
)
/ denominator
); // --------------------------------------------- 3x REG + OP-s.SUB + OP-t.FDIC + OP-u.SUB <----THE MOST EXPENSIVE .FDIV BUT HERE, IFF INDEED FIRST NEEDED <---HERE <------------[3]
//~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~J.I.T./non-MISRA-C-RET-->
// TEST CONDITIONS FOR PRE-FINAL RET J.I.T./non-MISRA-C-RET-->
//
if ( c < 0 || c > 1 ) return( false );
return( true ); //~~~~~~~~~~~~~~~~~~~ "the-last-resort" RET-->
b) 算法的其他方法:
让我们回顾一下另一种方法,这种方法似乎既更快又更便宜,因为数据更少,指令更少,而且一旦智能使用矢量化 AVX-2 或更好的 AVX-512,它也有望具有高密度vector-instructions 将针对每个核心进行利用:ANIMATED, fully-INTERACTIVE explanation is altogether with analytical problem re-formulation here.
triple-test 的 point-to-line 距离(每条线是 ax + by + c = 0
)的成本很低~ 2 FMA3
就足够了 + 符号测试(如果 vector-4-in-1-compact AVX-2 / 8-in-1 AVX-512 VFMADD
-s 甚至更好)
虽然可以 "fast" 决定针对相应的三角形测试一个点是否有意义,但可能需要 中的每个三角形静态预 "framing"polar(R,Theta)
坐标 space 由 ( R_min, R_max, Theta_min, Theta_max )
和 "fast" 的静态预计算元组区分每个点,如果它不适合这样的 polar-segment .然而,这样做的成本(data-access 模式成本 + 这些 "fast" 指令的成本)将超出任何潜在的 "saved" instruction-paths,这不需要发生(如果point 位于 polar-segment 之外)。在达到 24 points-in-triangle 测试的性能范围后,每 6 CPU-cores @ 3.0+ GHz 的成本约为 9 CPU-instructions,polar-segment "pre-testing" 将突然变得非常昂贵,而不是谈论二阶负面影响(通过更糟糕的 cache-hit / cache-miss 比率引入,因为要存储更多数据并将其读入 "fast"-pre-test ~ +16B per a triangle "framing" polar-segment tuple +8B per point (对缓存的影响最差hit/miss-ratio).
这显然不是任何进一步行动的好方向,因为性能会下降,而不是上升,这是我们这里的全球战略。
Intel i5 8500 CPU 可以使用但 AVX-2,因此如果需要,每个内核每个 CPU-clock 滴答 8 个三角形的最紧凑使用可以实现甚至高出 2 倍的性能。
TRIPLE-"point-above-line"-TEST per POINT per TRIANGLE:
---------------------------------------------------------------------------------
PRE-COMPUTE STATIC per TRIANGLE CONSTANTS:
LINE_1: C0__L1, C1__L1, C2__L1, bool_L1DistanceMustBePOSITIVE
LINE_2: C0__L2, C1__L2, C2__L2, bool_L2DistanceMustBePOSITIVE
LINE_3: C0__L3, C1__L3, C2__L3, bool_L3DistanceMustBePOSITIVE
TEST per TRIANGLE per POINT (Px,Py) - best executed in an AVX-vectorised fashion
LINE_1_______________________________________________________________
C0__L1 ( == pre-calc'd CONST = c1 / sqrt( a1^2 + b1^2 ) ) //
Px * C1__L1 ( == pre-calc'd CONST = a1 / sqrt( a1^2 + b1^2 ) ) // OP-1.FMA REG-Px,C1__L1,C0__L1
Py * C2__L1 ( == pre-calc'd CONST = b1 / sqrt( a1^2 + b1^2 ) ) // OP-2.FMA REG-Py,C2__L1, +
.GT./.LT. 0 // OP-3.SIG
LINE_2_______________________________________________________________
C0__L2 ( == pre-calc'd CONST = c2 / sqrt( a2^2 + b2^2 ) ) //
Px * C1__L2 ( == pre-calc'd CONST = a2 / sqrt( a2^2 + b2^2 ) ) // OP-4.FMA REG-Px,C1__L2,C0__L2
Py * C2__L2 ( == pre-calc'd CONST = b2 / sqrt( a2^2 + b2^2 ) ) // OP-5.FMA REG-Py,C2__L2, +
.GT./.LT. 0 // OP-6.SIG
LINE_3_______________________________________________________________
C0__L3 ( == pre-calc'd CONST = c3 / sqrt( a3^2 + b3^2 ) ) //
Px * C1__L3 ( == pre-calc'd CONST = a3 / sqrt( a3^2 + b3^2 ) ) // OP-7.FMA REG-Px,C1__L3,C0__L3
Py * C2__L3 ( == pre-calc'd CONST = b3 / sqrt( a3^2 + b3^2 ) ) // OP-8.FMA REG-Py,C2__L3, +
.GT./.LT. 0 // OP-9.SIG
( using AVX-2 intrinsics or inlined assembler will deliver highest performance due to COMPACT 4-in-1 VFMADDs )
____________________________________________
| __________________________________________triangle A: C1__L1
| | ________________________________________triangle B: C1__L1
| | | ______________________________________triangle C: C1__L1
| | | | ____________________________________triandle D: C1__L1
| | | | |
| | | | | ______________________________
| | | | | | ____________________________triangle A: Px
| | | | | | | __________________________triangle B: Px
| | | | | | | | ________________________triangle C: Px
| | | | | | | | | ______________________triandle D: Px
| | | | | | | | | |
|1|2|3|4| | | | | |
| | | | | |1|2|3|4| ________________
| | | | | | | | | | | ______________triangle A: C0__L1
| | | | | | | | | | | | ____________triangle B: C0__L1
| | | | | | | | | | | | | __________triangle C: C0__L1
| | | | | | | | | | | | | | ________triandle D: C0__L1
| | | | | | | | | | | | | | |
|1|2|3|4| | | | | | | | | | |
| | | | | |1|2|3|4| | | | | |
| | | | | | | | | | |1|2|3|4|
(__m256d) __builtin_ia32_vfmaddpd256 ( (__v4df )__A, (__v4df )__B, (__v4df )__C ) ~/ per CPU-core @ 3.0 GHz ( for actual uops durations check Agner or Intel CPU documentation )
can
perform 4-( point-in-triangle ) PARALLEL-test in just about ~ 9 ASSEMBLY INSTRUCTIONS / per CPU-core @ 3.0 GHz
24-( point-in-triangle ) PARALLEL-test in just about ~ 9 ASSEMBLY INSTRUCTIONS / per CPU
using AVX-512 empowered CPU, can use 8-in-1 VFMADDs
could
perform 8-( point-in-triangle ) PARALLEL-test in just about ~ 9 ASSEMBLY INSTRUCTIONS / per CPU-core @ 3.0 GHz
48-( point-in-triangle ) PARALLEL-test in just about ~ 9 ASSEMBLY INSTRUCTIONS / per CPU
c) 获得最佳并行代码 + 终极性能的技巧:
步骤 -1: GPU / CUDA 成本 v/s 收益
如果您的 PhD-mentor、教授、老板或项目经理确实坚持要您为这个问题开发一个 GPU-computing c++/CUDA 代码解决方案,那么下一步最好是要求获得任何比您发布的任务更适合 GPU-card 这样的任务。
您指定的卡,Q400 GPU has just 2 SMX (48KB L1-cache each) 不太适合真正意义上的并行 CUDA-computing,实际执行任何 SIMD-thread-block 计算的处理设备少了大约 30 倍,更不用说它的小内存和微小的 L1 / L2 on-SMX-caches。因此,在所有 CUDA-related 设计和优化工作之后,GPU-SMX warp32
-wide thread-block 中的执行不会更多,而是一对(!) =228=],所以这个基于 GP107 的设备没有大马戏团,并且有 30+X-better 个配备了一些确实高性能 parallel-processing 可用 COTS 的设备)
第 0 步: pre-compute 和 pre-arrange 数据(MAXIMISE cache-line COHERENCY):
在这里,优化算法的最佳宏观循环是有意义的,这样您 "benefit" 大部分来自缓存命中(即 "fast" 数据的最佳 re-use,pre-fetched 已经 )
因此,可以测试一下,如果将工作分配到 N-concurrent-workers 上 cache-wise 更快,谁处理三角形的分离 working-blocks,其中每个循环遍历最小的 memory-area --- 所有点(~ 10k * 2 * 4B ~ 80 kB),然后进入 working-block 中的下一个三角形。确保 row-first 数组对齐到 memory-area 是至关重要的(对于 HPC-fast compact/vectorised vector-algebra 和 matrix-processing )
好处是什么?
缓存的系数将re-used ~ 10k次(代价是~ 0.5~1.0 [ns]
,而不是re-fetching + 100 ~ 300 [ns]
的成本(如果这些必须通过 RAM 内存访问 re-read)。差异,关于 ~ 200X ~ 600X
值得努力使工作流程 sub-ordinated 与 data-access 模式和 cache-line 资源保持最佳对齐。
结果是原子的 - 任何点都属于一个且仅一个 (non-overlapping) 个三角形。
这简化了先验 non-colliding 并相对稀疏地写入结果向量,任何检测到的向量 point-inside-triangle 都可以自由报告找到的此类三角形的索引。
使用此结果向量潜在地避免任何 re-testing 点,其中已经执行了匹配(索引为 non-negative )不是很有效,因为 [=248 的成本=]ing 此类指示和 re-arranging 4 合 1 或 8 合 1 point-in-triangle 测试的紧凑对齐,对于潜在 "savings" 而非 re-testing 一个已经映射的点。
所以 10k
点在一块 300k
三角形上可能会产生一个工作流:
拆分一些 300k / 6
个核心 ~ 50k
个三角形/1 个核心 ~ 50k * 10 k
point-in-triangle 每个核心测试 ~ 500M
point-in-triangle 每个核心测试 @ 3.0+ GHz
~ 125M AVX-2
vector-4-in -1-compact test-executions per-core~ 125M
tests x 10 uops
-instructions @ 3.0 GHz
.. .
哪个是1.25G uops @ 3.0+ GHz
...秒?是的,有一种非常强烈的动机要走到这里,朝着最终的表现前进,并以这种方式指导进一步的工作。
所以,我们在这里:
在 3.0+ GHz 的 6 核 AVX-2 i5 8500 上,300+k 三角形/10+k 点的主要可实现 HPC-target 在几秒的范围内
值得努力,不是吗?
Quad trees are nice if you have to do lots of single queries, but if you've got 10k to do all at once there is an algorithm purpose-built for this: The sweep line. I'm getting query times of less than 5 seconds on similar data in a single thread.
We're going to take a vertical line and sweep it from left to right across your 2d plane, and we're going to keep track of which triangles that line is intersecting at any point in time. So, whenever your line intersects one of your query points as it's sweeping, you only ever have to check the triangles that your vertical line is overlapping. No other triangles can contain that point.
I kind of hate the name "sweepline" in this case because it gives the mental image of smoothly traversing the plane, which it doesn't do. It just jumps to the next position of interest in order from left to right.
Depending on the ratio of triangles to queries, you could also put the overlapping triangles in a segment tree or interval tree by their y-coordinates, but this was actually slower for my data (or my implementation was bad. Also possible). Definitely worth a try though, especially if you can put off balancing the tree until its needed.
The setup time is just 3 sorts:
- Get a vector of pointers to your triangles sorted by the minimum x-value of the tri
- Get a vector of pointers to your triangles sorted by the maximum x-value of the tri
- Get a vector of pointers to your points sorted by their x-value
Then we sweep:
- Initialize the "current positions" for each of your 3 sorted lists to 0
- Look for the lowest x-value at the current position in your 3 sorted lists. This will be the next x-position of your sweepline
- If it's the leftmost point of a triangle, then add that triangle to your "overlapping" vector
- If it's a query point, then check that point against all the currently overlapping tris
- If it's the rightmost point of a triangle, then remove it from the overlapping vector.
- increment the current position for that list
- Loop until you're out of points, or out of tris.
Why is this better than a quadtree? Because we're always keeping track of the tris currently intersecting our sweepline. This means we're re-using data between queries, and quadtrees don't really do that. Quadtrees will have much better single-query performance, but this kind of bulk lookup is made for the sweepline.