有什么方法可以使下面的代码在 Eigen squaredNorm 中更快

Is there any way to makethis below code faster in Eigen squaredNorm

我有以下 Eigen C++ 代码并进行了 10 百万次 scuredNorm 计算。

有没有办法让它更robust/faster .

#include <Eigen/Core>
#include <tbb/parallel_for.h>
#include "tbb/tbb.h"
#include <mutex>
#include <opencv2/opencv.hpp>

int main(){ 

int numberOFdata = 10000008;
Eigen::MatrixXf feat = Eigen::MatrixXf::Random(numberOFdata,512);
Eigen::MatrixXf b_cmp= Eigen::MatrixXf::Random(1,512);
int count_feature = feat.rows();
std::vector<int> found_number ;
std::mutex mutex1;

for (int loop = 0 ; loop<16 ; loop++){

double start_1 = static_cast<double>(cv::getTickCount());                                                                  
tbb::affinity_partitioner ap;
tbb::parallel_for( tbb::blocked_range<int>(0,count_feature),
                       [&](tbb::blocked_range<int> r  ) 
    {
        for (int i=r.begin(); i<r.end(); ++i)
        {

        auto distance = ( feat.row(i)- b_cmp ).squaredNorm();                                   
        if (distance < 0.5) {                                                                   
                mutex1.lock();            
                found_number.push_back(i);
                mutex1.unlock();
                            
                           }
        }
    },ap);

double timefin = ((double)cv::getTickCount() - start_1) / cv::getTickFrequency();
std::cout  << count_feature   << " TOTAL : "  << timefin << std::endl;

}
}

编译标志:

-Xpreprocessor   -std=c++11  -fopenmp -pthread -O3  -mavx2 -march=native -funroll-loops -fpermissive

本征版本 3.3.7 tbb opencv 和 eigen 链接。

您可以删除 opencv 并使用不同的经过时间计算。

谢谢

如果您以与访问它相同的顺序存储 feat(即在您的情况下为 Eigen::RowMajor),您的速度应该会提高大约 4 倍。

删除所有与 Eigen 无关的东西的最小示例:

  int numberOFdata = 10000008;
  Eigen::Matrix<float,Eigen::Dynamic, Eigen::Dynamic, Eigen::RowMajor> feat = Eigen::MatrixXf::Random(numberOFdata, 512);
  Eigen::RowVectorXf b_cmp = Eigen::MatrixXf::Random(1, 512);
  int count_feature = feat.rows();
  std::vector<int> found_number;

  for (int loop = 0; loop < 16; loop++) {
    auto start = std::chrono::steady_clock::now();
    {
      for (int i = 0; i < feat.rows(); ++i) {
        float distance = (feat.row(i) - b_cmp).squaredNorm();
        if (distance < 0.5f) {
          found_number.push_back(i);
        }
      }
    };

    auto end = std::chrono::steady_clock::now();
    std::chrono::duration<double> diff = end-start;
    std::cout  << count_feature   << " TOTAL : "  <<
     diff.count() << std::endl;
  }

Godbolt-Demo(由于内存限制,feat 的尺寸减小):https://godbolt.org/z/b6r5K4Yxv