为什么犰狳在简单的按行计算任务中与 C 样式数组相比如此慢
Why is Armadillo so slow compared to a C-style array in a simple row-wise computationnal task
我目前正在为一个大矩阵(数百万行,列数 < 1000)的每个值计算少量,同时独立考虑每一行。
更准确地说,对于每一行中的每个值 M(i,j) i,这个矩阵的第j列,数量就是[M( i,j) - 均值(i,s) ] / std( i,s) 其中 s 是子集 s在 M(i,:) - j
换句话说,s 是行 i 的所有值的子集,没有值 j.
我比较了两种实现,一种是 C 风格的数组,一种是 Armadillo,Armadillo 在执行时间方面大约慢两倍。我希望执行时间类似或稍微慢一些,但普通 C 数组似乎可以显着提高性能。
有什么特别的原因或我遗漏了什么地方吗?这是一个编译的例子:-O2 -lstdc++ -DARMA_DONT_USE_WRAPPER -lopenblas -llapack -lm
。也尝试使用 ARMA_NO_DEBUG
没有成功。
#include <string>
#include <vector>
#include <iostream>
#include <fstream>
#include <algorithm>
#include <armadillo>
#include <chrono>
using namespace std::chrono;
/***************************
* main()
***************************/
int main( int argc, char *argv[] )
{
unsigned nrows = 2000000; //number of rows
unsigned ncols = 100; //number of cols
const arma::mat huge_mat = arma::randn(nrows, ncols); //create huge matrix
const arma::uvec vec = arma::linspace<arma::uvec>( 0, huge_mat.n_cols-1, huge_mat.n_cols); //create a vector of [0,...,n]
arma::rowvec inds = arma::zeros<arma::rowvec>( huge_mat.n_cols-1 ); //-1 since we remove only one value at each step.
arma::colvec simuT = arma::zeros<arma::colvec>( ncols ); //let's store the results in this simuT vector.
high_resolution_clock::time_point t1 = high_resolution_clock::now();
//compute some normalization over each value of line of this huge matrix:
for(unsigned i=0; i < nrows; i++) {
const arma::rowvec current_line = huge_mat.row(i); //extract current line
//for each observation in current_line:
for(unsigned j=0; j < ncols; j++) {
//Take care of side effects first:
if( j == 0 )
inds = current_line(arma::span(1, ncols-1));
else
if( j == 1 ) {
inds(0) = current_line(0);
inds(arma::span(1, ncols-2)) = current_line( arma::span(2, ncols-1) );
} else
inds(arma::span(0, j-1)) = current_line( arma::span(0, j-1) );
//Let's do some computation: huge_mat(i,j) - mean[huge_mat(i,:)] / std([huge_mat(i,:)]) //can compute the mean and std first... for each line.
simuT(j) = (current_line(j) - arma::mean(inds)) / ( std::sqrt( 1+1/((double) ncols-1) ) * arma::stddev(inds) );
}
}
high_resolution_clock::time_point t2 = high_resolution_clock::now();
auto duration = duration_cast<seconds>( t2 - t1 ).count();
std::cout << "ARMADILLO: " << duration << " secs\n";
//------------------PLAIN C Array
double *Mat_full;
double *output;
unsigned int i,j,k;
double mean=0, stdd=0;
double sq_diff_sum = 0, sum=0;
double diff = 0;
Mat_full = (double *) malloc(ncols * nrows * sizeof(double));
output = (double *) malloc(nrows * ncols * sizeof(double));
std::vector< std::vector<double> > V(huge_mat.n_rows);
//Some UGLY copy from arma::mat to double* using a vector:
for (size_t i = 0; i < huge_mat.n_rows; ++i)
V[i] = arma::conv_to< std::vector<double> >::from(huge_mat.row(i));
//then dump to Mat_full array:
for (i=0; i < V.size(); i++)
for (j=0; j < V[i].size(); j++)
Mat_full[i + huge_mat.n_rows * j] = V[i][j];
t1 = high_resolution_clock::now();
for(i=0; i < nrows; i++)
for(j=0; j < ncols; j++)
{
//compute mean of subset-------------------
sum = 0;
for(k = 0; k < ncols; k++)
if(k!=j)
{
sum = sum + Mat_full[i+k*nrows];
}
mean = sum / (ncols-1);
//compute standard deviation of subset-----
sq_diff_sum = 0;
for(k = 0; k < ncols; k++)
if(k!=j)
{
diff = Mat_full[i+k*nrows] - mean;
sq_diff_sum += diff * diff;
}
stdd = sqrt(sq_diff_sum / (ncols-2));
//export to plain C array:
output[i*ncols+j] = (Mat_full[i+j*nrows] - mean) / (sqrt(1+1/(((double) ncols)-1))*stdd);
}
t2 = high_resolution_clock::now();
duration = duration_cast<seconds>( t2 - t1 ).count();
std::cout << "C ARRAY: " << duration << " secs\n";
}
特别是对 arma::mean 和 arma::stddev 的调用在比较执行时间时似乎表现不佳。我没有对大小对性能的影响进行任何深入分析,但似乎对于 nrows
的小值,普通 C 往往(非常)更快。对于使用这个的简单测试
我得到的设置:
ARMADILLO: 111 secs
C ARRAY: 79 secs
执行时间。
编辑
这是修改,我们按列而不是按行工作,并独立处理每一列,正如@rubenvb 和@mtall 所建议的那样。所产生的执行时间略有减少(ARMADILLO: 104 secs
现在),因此显示了按行工作的一些改进:
#include <string>
#include <vector>
#include <iostream>
#include <fstream>
#include <algorithm>
#include <armadillo>
#include <chrono>
using namespace std::chrono;
/***************************
* main()
***************************/
int main( int argc, char *argv[] )
{
unsigned nrows = 100; //number of rows
unsigned ncols = 2000000; //number of cols
const arma::mat huge_mat = arma::randn(nrows, ncols); //create huge matrix
const arma::uvec vec = arma::linspace<arma::uvec>( 0, huge_mat.n_rows-1, huge_mat.n_rows); //create a vector of [0,...,n]
arma::colvec inds = arma::zeros<arma::colvec>( huge_mat.n_rows-1 ); //-1 since we remove only one value at each step.
arma::rowvec simuT = arma::zeros<arma::rowvec>( nrows ); //let's store the results in this simuT vector.
high_resolution_clock::time_point t1 = high_resolution_clock::now();
//compute some normalization over each value of line of this huge matrix:
for(unsigned i=0; i < ncols; i++) {
const arma::colvec current_line = huge_mat.col(i); //extract current line
//for each observation in current_line:
for(unsigned j=0; j < nrows; j++) {
//Take care of side effects first:
if( j == 0 )
inds = current_line(arma::span(1, nrows-1));
else
if( j == 1 ) {
inds(0) = current_line(0);
inds(arma::span(1, nrows-2)) = current_line( arma::span(2, nrows-1) );
} else
inds(arma::span(0, j-1)) = current_line( arma::span(0, j-1) );
//Let's do some computation: huge_mat(i,j) - mean[huge_mat(i,:)] / std([huge_mat(i,:)]) //can compute the mean and std first... for each line.
simuT(j) = (current_line(j) - arma::mean(inds)) / ( std::sqrt( 1+1/((double) nrows-1) ) * arma::stddev(inds) );
}
}
high_resolution_clock::time_point t2 = high_resolution_clock::now();
auto duration = duration_cast<seconds>( t2 - t1 ).count();
std::cout << "ARMADILLO: " << duration << " secs\n";
}
原因是 Armadillo 使用 column-major ordering in mat, while your C array uses row-major ordering. This is kind of a big deal because your processor can use instruction vectorization 一次处理多个元素,这需要连续的内存块。
要验证这是否是原因,请执行相同的计算但针对列而不是行,并检查差异。
我目前正在为一个大矩阵(数百万行,列数 < 1000)的每个值计算少量,同时独立考虑每一行。
更准确地说,对于每一行中的每个值 M(i,j) i,这个矩阵的第j列,数量就是[M( i,j) - 均值(i,s) ] / std( i,s) 其中 s 是子集 s在 M(i,:) - j 换句话说,s 是行 i 的所有值的子集,没有值 j.
我比较了两种实现,一种是 C 风格的数组,一种是 Armadillo,Armadillo 在执行时间方面大约慢两倍。我希望执行时间类似或稍微慢一些,但普通 C 数组似乎可以显着提高性能。
有什么特别的原因或我遗漏了什么地方吗?这是一个编译的例子:-O2 -lstdc++ -DARMA_DONT_USE_WRAPPER -lopenblas -llapack -lm
。也尝试使用 ARMA_NO_DEBUG
没有成功。
#include <string>
#include <vector>
#include <iostream>
#include <fstream>
#include <algorithm>
#include <armadillo>
#include <chrono>
using namespace std::chrono;
/***************************
* main()
***************************/
int main( int argc, char *argv[] )
{
unsigned nrows = 2000000; //number of rows
unsigned ncols = 100; //number of cols
const arma::mat huge_mat = arma::randn(nrows, ncols); //create huge matrix
const arma::uvec vec = arma::linspace<arma::uvec>( 0, huge_mat.n_cols-1, huge_mat.n_cols); //create a vector of [0,...,n]
arma::rowvec inds = arma::zeros<arma::rowvec>( huge_mat.n_cols-1 ); //-1 since we remove only one value at each step.
arma::colvec simuT = arma::zeros<arma::colvec>( ncols ); //let's store the results in this simuT vector.
high_resolution_clock::time_point t1 = high_resolution_clock::now();
//compute some normalization over each value of line of this huge matrix:
for(unsigned i=0; i < nrows; i++) {
const arma::rowvec current_line = huge_mat.row(i); //extract current line
//for each observation in current_line:
for(unsigned j=0; j < ncols; j++) {
//Take care of side effects first:
if( j == 0 )
inds = current_line(arma::span(1, ncols-1));
else
if( j == 1 ) {
inds(0) = current_line(0);
inds(arma::span(1, ncols-2)) = current_line( arma::span(2, ncols-1) );
} else
inds(arma::span(0, j-1)) = current_line( arma::span(0, j-1) );
//Let's do some computation: huge_mat(i,j) - mean[huge_mat(i,:)] / std([huge_mat(i,:)]) //can compute the mean and std first... for each line.
simuT(j) = (current_line(j) - arma::mean(inds)) / ( std::sqrt( 1+1/((double) ncols-1) ) * arma::stddev(inds) );
}
}
high_resolution_clock::time_point t2 = high_resolution_clock::now();
auto duration = duration_cast<seconds>( t2 - t1 ).count();
std::cout << "ARMADILLO: " << duration << " secs\n";
//------------------PLAIN C Array
double *Mat_full;
double *output;
unsigned int i,j,k;
double mean=0, stdd=0;
double sq_diff_sum = 0, sum=0;
double diff = 0;
Mat_full = (double *) malloc(ncols * nrows * sizeof(double));
output = (double *) malloc(nrows * ncols * sizeof(double));
std::vector< std::vector<double> > V(huge_mat.n_rows);
//Some UGLY copy from arma::mat to double* using a vector:
for (size_t i = 0; i < huge_mat.n_rows; ++i)
V[i] = arma::conv_to< std::vector<double> >::from(huge_mat.row(i));
//then dump to Mat_full array:
for (i=0; i < V.size(); i++)
for (j=0; j < V[i].size(); j++)
Mat_full[i + huge_mat.n_rows * j] = V[i][j];
t1 = high_resolution_clock::now();
for(i=0; i < nrows; i++)
for(j=0; j < ncols; j++)
{
//compute mean of subset-------------------
sum = 0;
for(k = 0; k < ncols; k++)
if(k!=j)
{
sum = sum + Mat_full[i+k*nrows];
}
mean = sum / (ncols-1);
//compute standard deviation of subset-----
sq_diff_sum = 0;
for(k = 0; k < ncols; k++)
if(k!=j)
{
diff = Mat_full[i+k*nrows] - mean;
sq_diff_sum += diff * diff;
}
stdd = sqrt(sq_diff_sum / (ncols-2));
//export to plain C array:
output[i*ncols+j] = (Mat_full[i+j*nrows] - mean) / (sqrt(1+1/(((double) ncols)-1))*stdd);
}
t2 = high_resolution_clock::now();
duration = duration_cast<seconds>( t2 - t1 ).count();
std::cout << "C ARRAY: " << duration << " secs\n";
}
特别是对 arma::mean 和 arma::stddev 的调用在比较执行时间时似乎表现不佳。我没有对大小对性能的影响进行任何深入分析,但似乎对于 nrows
的小值,普通 C 往往(非常)更快。对于使用这个的简单测试
我得到的设置:
ARMADILLO: 111 secs
C ARRAY: 79 secs
执行时间。
编辑
这是修改,我们按列而不是按行工作,并独立处理每一列,正如@rubenvb 和@mtall 所建议的那样。所产生的执行时间略有减少(ARMADILLO: 104 secs
现在),因此显示了按行工作的一些改进:
#include <string>
#include <vector>
#include <iostream>
#include <fstream>
#include <algorithm>
#include <armadillo>
#include <chrono>
using namespace std::chrono;
/***************************
* main()
***************************/
int main( int argc, char *argv[] )
{
unsigned nrows = 100; //number of rows
unsigned ncols = 2000000; //number of cols
const arma::mat huge_mat = arma::randn(nrows, ncols); //create huge matrix
const arma::uvec vec = arma::linspace<arma::uvec>( 0, huge_mat.n_rows-1, huge_mat.n_rows); //create a vector of [0,...,n]
arma::colvec inds = arma::zeros<arma::colvec>( huge_mat.n_rows-1 ); //-1 since we remove only one value at each step.
arma::rowvec simuT = arma::zeros<arma::rowvec>( nrows ); //let's store the results in this simuT vector.
high_resolution_clock::time_point t1 = high_resolution_clock::now();
//compute some normalization over each value of line of this huge matrix:
for(unsigned i=0; i < ncols; i++) {
const arma::colvec current_line = huge_mat.col(i); //extract current line
//for each observation in current_line:
for(unsigned j=0; j < nrows; j++) {
//Take care of side effects first:
if( j == 0 )
inds = current_line(arma::span(1, nrows-1));
else
if( j == 1 ) {
inds(0) = current_line(0);
inds(arma::span(1, nrows-2)) = current_line( arma::span(2, nrows-1) );
} else
inds(arma::span(0, j-1)) = current_line( arma::span(0, j-1) );
//Let's do some computation: huge_mat(i,j) - mean[huge_mat(i,:)] / std([huge_mat(i,:)]) //can compute the mean and std first... for each line.
simuT(j) = (current_line(j) - arma::mean(inds)) / ( std::sqrt( 1+1/((double) nrows-1) ) * arma::stddev(inds) );
}
}
high_resolution_clock::time_point t2 = high_resolution_clock::now();
auto duration = duration_cast<seconds>( t2 - t1 ).count();
std::cout << "ARMADILLO: " << duration << " secs\n";
}
原因是 Armadillo 使用 column-major ordering in mat, while your C array uses row-major ordering. This is kind of a big deal because your processor can use instruction vectorization 一次处理多个元素,这需要连续的内存块。
要验证这是否是原因,请执行相同的计算但针对列而不是行,并检查差异。