如何在存在 for 循环和矩阵向量乘法的情况下减少计算时间
How to reduce the time of computation in presence of for loops and matrix-vector multiplication
我正在编写一个程序,其中计算时间非常重要,因此我必须以减少时间的方式编写代码。在下面,我写了一个代码,但是如果我的向量长度变大,它会很耗时。无论如何,有没有更快的方式产生相同的结果?
K1 = [1 2 3 4 5]; K2 = [6 7 8 9 10];
kt1 = [1.5 3 4.5]; kt2 = [6.5 8 9.5];
numk1 = bsxfun(@minus,K1.',kt1);
denomk1 = bsxfun(@minus, kt1.',kt1)+eye(numel(kt1));
numk2 = bsxfun(@minus,K2.',kt2);
denomk2 = bsxfun(@minus, kt2.', kt2)+eye(numel(kt2));
for j=1:numel(kt1)
for jj=1:numel(kt2)
k1_dir = bsxfun(@rdivide,numk1,denomk1(j,:)); k1_dir(:,j)=[];
k_dir1 = prod(k1_dir,2);
k2_dir = bsxfun(@rdivide,numk2,denomk2(jj,:)); k2_dir(:,jj)=[];
k_dir2 = prod(k2_dir,2);
k1_k2(:,:,j,jj) = k_dir1 * k_dir2';
end
end
在上面的代码中,随着K1
和K2
的长度增加,kt1
和kt2
的长度也会增加。因此,对于长向量长度,此代码非常耗时。
在 vectorization
上全力以赴,这可能是替换问题中列出的代码的循环部分的一种方法 -
%// Size parameters
M1 = numel(K1);
N1 = numel(kt1);
M2 = numel(K2);
N2 = numel(kt2);
%// Indices to be removed from k1_dir & k2_dir.
%// In our vectorized version, we will just use these to set
%// corresponding elements in vectorized versions of k1_dir & k2_dir
%// to ONES, as later on PROD would take care of it.
rm_idx1 = bsxfun(@plus,[1:M1]',[0:N1-1]*(M1*N1+M1)); %//'
rm_idx2 = bsxfun(@plus,[1:M2]',[0:N2-1]*(M2*N2+M2)); %//'
%// Get vectorized version of k1_dir, as k1_dirv
k1_dirv = bsxfun(@rdivide,numk1,permute(denomk1,[3 2 1]));
k1_dirv(rm_idx1) = 1;
k_dir1v = prod(k1_dirv,2);
%// Get vectorized version of k2_dir, as k2_dirv
k2_dirv = bsxfun(@rdivide,numk2,permute(denomk2,[3 2 1]))
k2_dirv(rm_idx2) = 1;
k_dir2v = prod(k2_dirv,2);
%// Get vectorized version of k1_k2, as k1_k2v
k1_k2v = bsxfun(@times,k_dir1v,permute(k_dir2v,[2 1 4 3]));
快速运行时测试:
像这样设置输入 -
SZ1 = 100;
SZ2 = 100;
K1 = randi(9,1,SZ1);
K2 = randi(9,1,SZ1);
kt1 = randi(9,1,SZ2);
kt2 = randi(9,1,SZ2);
原始循环部分的运行时间(在添加带有零的预分配代码以实现更公平的基准测试之后)和建议的矢量化方法是 -
-------------------------- With Original Loopy Approach
Elapsed time is 1.086666 seconds.
-------------------------- With Proposed Vectorized Approach
Elapsed time is 0.178805 seconds.
似乎 JIT
没有显示它的魔力,至少当 bsxfun
在嵌套循环中使用时没有,而且你需要在每次迭代中索引到那个巨大的 4D 数组对你没有帮助。因此,在这种情况下全速进行矢量化更有意义!
我正在编写一个程序,其中计算时间非常重要,因此我必须以减少时间的方式编写代码。在下面,我写了一个代码,但是如果我的向量长度变大,它会很耗时。无论如何,有没有更快的方式产生相同的结果?
K1 = [1 2 3 4 5]; K2 = [6 7 8 9 10];
kt1 = [1.5 3 4.5]; kt2 = [6.5 8 9.5];
numk1 = bsxfun(@minus,K1.',kt1);
denomk1 = bsxfun(@minus, kt1.',kt1)+eye(numel(kt1));
numk2 = bsxfun(@minus,K2.',kt2);
denomk2 = bsxfun(@minus, kt2.', kt2)+eye(numel(kt2));
for j=1:numel(kt1)
for jj=1:numel(kt2)
k1_dir = bsxfun(@rdivide,numk1,denomk1(j,:)); k1_dir(:,j)=[];
k_dir1 = prod(k1_dir,2);
k2_dir = bsxfun(@rdivide,numk2,denomk2(jj,:)); k2_dir(:,jj)=[];
k_dir2 = prod(k2_dir,2);
k1_k2(:,:,j,jj) = k_dir1 * k_dir2';
end
end
在上面的代码中,随着K1
和K2
的长度增加,kt1
和kt2
的长度也会增加。因此,对于长向量长度,此代码非常耗时。
在 vectorization
上全力以赴,这可能是替换问题中列出的代码的循环部分的一种方法 -
%// Size parameters
M1 = numel(K1);
N1 = numel(kt1);
M2 = numel(K2);
N2 = numel(kt2);
%// Indices to be removed from k1_dir & k2_dir.
%// In our vectorized version, we will just use these to set
%// corresponding elements in vectorized versions of k1_dir & k2_dir
%// to ONES, as later on PROD would take care of it.
rm_idx1 = bsxfun(@plus,[1:M1]',[0:N1-1]*(M1*N1+M1)); %//'
rm_idx2 = bsxfun(@plus,[1:M2]',[0:N2-1]*(M2*N2+M2)); %//'
%// Get vectorized version of k1_dir, as k1_dirv
k1_dirv = bsxfun(@rdivide,numk1,permute(denomk1,[3 2 1]));
k1_dirv(rm_idx1) = 1;
k_dir1v = prod(k1_dirv,2);
%// Get vectorized version of k2_dir, as k2_dirv
k2_dirv = bsxfun(@rdivide,numk2,permute(denomk2,[3 2 1]))
k2_dirv(rm_idx2) = 1;
k_dir2v = prod(k2_dirv,2);
%// Get vectorized version of k1_k2, as k1_k2v
k1_k2v = bsxfun(@times,k_dir1v,permute(k_dir2v,[2 1 4 3]));
快速运行时测试:
像这样设置输入 -
SZ1 = 100;
SZ2 = 100;
K1 = randi(9,1,SZ1);
K2 = randi(9,1,SZ1);
kt1 = randi(9,1,SZ2);
kt2 = randi(9,1,SZ2);
原始循环部分的运行时间(在添加带有零的预分配代码以实现更公平的基准测试之后)和建议的矢量化方法是 -
-------------------------- With Original Loopy Approach
Elapsed time is 1.086666 seconds.
-------------------------- With Proposed Vectorized Approach
Elapsed time is 0.178805 seconds.
似乎 JIT
没有显示它的魔力,至少当 bsxfun
在嵌套循环中使用时没有,而且你需要在每次迭代中索引到那个巨大的 4D 数组对你没有帮助。因此,在这种情况下全速进行矢量化更有意义!