使用辛普森规则加速集成二维 gpuArray 矩阵的代码

Question

我有一个（真实的）2D gpuArray，我将其用作更大代码的一部分，现在我也在尝试在我的主循环中使用 Composite Simpson Rule 来集成数组（数次 10000 次迭代至少）。 MWE 如下所示：

%%%%%%%%%%%%%%%%%% MAIN CODE %%%%%%%%%%%%%%%%%%
Ny = 501;    % Dimensions of matrix M
Nx = 503;    %
dx = 0.1;    % Grid spacings
dy = 0.2;    %

M = rand(Ny, Nx, 'gpuArray'); % Initialise a  matrix

for k = 1:10000
    % M = function1(M)  % Apply some other functions to M
    % ... etc ...
    I = simpsons_integration_2D(M, dx, dy, Nx, Ny); % Now integrate M
end
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%%%%%%%%%%%%%%%%%% Integrator %%%%%%%%%%%%%%%%%
function I = simpsons_integration_2D(F, dx, dy, Nx, Ny)
% Integrate the 2D function F with Nx columns and Ny rows, and grid spacings
% dx and dy using Simpson's rule.

% Integrate along x direction (vertically) --> IX is a vector afterwards
sX = sum( F(:,1:2:Nx-2) + 4*F(:,2:2:(Nx-1)) + F(:,3:2:Nx) , 2);
IX = dx/3 * sX;

% Integrate along y direction --> I is a scalar afterwards
sY = sum( IX(1:2:Ny-2) + 4*IX(2:2:(Ny-1)) + IX(3:2:Ny) , 1);
I = dy/3 * sY;
end
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

执行集成的操作大约需要 850 µs，这是我目前代码的重要部分。这是使用

测量的

f = @() simpsons_integration_2D(M, dx, dy, Nx, Ny);
t = gputimeit(f)

有没有办法减少集成gpuArray矩阵的执行时间？（显卡为Nvidia Quadro P4000）

非常感谢

Answer 1

这里假设矩阵有奇数维度是优化函数的一种方法：

function I = simpsons_integration_2D(F, dx, dy, Nx, Ny)
    sX = 2 * sum(F,2) + 2 * sum (F(:,2:2:(Nx-1)),2) -  F(:,1) - F(:,Nx);
    sY = dx/3 * (2 * sum(sX) + 2 * sum (sX(2:2:(Ny-1))) - sX(1) - sX(Ny));
    I = dy/3 * sY;
end

编辑

使用矩阵乘法的更优化的解决方案：

function I = simpsons_integration_2D2(F, dx, dy, Nx, Ny)
  mx = repmat (2,  Nx, 1);
  mx(2:2:(Nx-1)) = 4;
  mx(1) = 1;
  mx(Nx) = 1;
  my = repmat (2,  1, Ny);
  my(2:2:(Ny-1)) = 4;
  my(1) = 1;
  my(Ny) = 1;
  I = (dx*dy/9) * (my * (F * mx));
end

如果 Nx 和 Ny 相同，您只需计算其中一个 mx 或 my:

function I = simpsons_integration_2D2(F, dx, dy, Nx, Ny)
  mx = repmat (2,  Nx, 1);
  mx(2:2:(Nx-1)) = 4;
  mx(1) = 1;
  mx(Nx) = 1;
  I = (dx*dy/9) * (mx.' * (F * mx));
end

如果 Nx 和 Ny 是常量，您可以在函数外预先计算 mx 并将其作为函数参数传递：

function I = simpsons_integration_2D2(F, dx, dy, mx)
  I = (dx*dy/9) * (mx.' * (F * mx));
end

编辑：

如果 mx 和 my 都可以预先计算，则问题将简化为点积：

m = reshape (my.' .* mx.', 1, []);

function I = simpsons_integration_2D3(F, dx, dy, m)
  I = (dx*dy/9) * (m * F(:));
end

Answer 2

好吧，我无法为您测试这个，但有一些事情可能会有所帮助。

首先是轴 1，然后是轴 2 可能会在修改项的局部性方面有所不同（我不知道是好是坏）。

function I = variation1(F, dx, dy, Nx, Ny)
% Sum each term separately, prevents the creation of a big intermediate matrix
% Multiply outside the summation does only Ny multiplications by 4 instead of Ny*Nx/2
sX = sum(F(:,1:2:Nx-2), 2) + 4*sum(F(:,2:2:(Nx-1)), 2) + sum(F(:,3:2:Nx), 2);
IX = dx/3 * sX;
sY = sum(IX(1:2:Ny-2), 1) + 4*sum(IX(2:2:(Ny-1)), 1) + sum(IX(3:2:Ny) , 1);
I = dy/3 * sY;
end

function I = variation2(F, dx, dy, Nx, Ny)
% a.
% Sum each term separately, prevents the creation of a big intermediate matrix
% Multiply outside the summation does only Ny multiplications by 4 instead of Ny*Nx/2
% b.
% Notice that the terms 2:3:NX-2 appear in two summations
% Saves Nx*Ny/2 additions at the expense of Ny multiplications by 2
sX = 2*sum(F(:,3:2:Nx-2), 2) + 4*sum(F(:,2:2:(Nx-1)), 2) + F(:,1) + F(:,Nx);
% saves Ny multiplications by moving the constant factor after the next sum
sY = 2*sum(sX(3:2:Ny-2), 1) + 4*sum(sX(2:2:(Ny-1)), 1) + sX(1) + sX(Ny);
I = (dy*dy/9) * sY;
end

function I = alternate_simpsons_integration_2D(F, dx, dy, Nx, Ny)
% Integrate the 2D function F with Nx columns and Ny rows, and grid spacings
% dx and dy using Simpson's rule.

% Notice that sum(F(:,1:2:Nx-2) + F(:,3:2:Nx)) have all but the end poitns repeated.
IX = 4*sum(F(:,2:2:Nx-1), 2) + 2 * sum(F(:,3:2:Nx-2) , 2) + F(:,1) + F(:,Nx);
disp(size(IX))
% Integrate along y direction --> I is a scalar afterwards
sY = 4*sum(IX(2:2:Ny-1)) + 2*sum(IX(3:2:Ny-2)) + IX(1) + IX(Ny);
I = dy*dy/9 * sY;
end

如果您认为进行单个求和更好，那么您可以使用公式 2*(sum(2*F(2:2:end-1) + F(1:2:end-2)) + F(end) - F(1) 来得出相同的结果，但在第一次积分时添加的次数较少 Nx*Ny/2。但是这些选项必须在您的环境中进行测试。

转置实施

function I = transposed_simpsons_integration_2D(F, dx, dy, Nx, Ny)
sY = 2*sum(2*F(2:2:end-1, :) + F(1:2:end-2, :), 1) + F(end, :) - F(1, :);
sX = 2*sum(2*sY(2:2:end-1) + sY(1:2:end-2)) + sY(end) - sY(1);
I = dy*dy/9 * sX;
end

使用八度音程（通常比 matlab 慢）我每次迭代得到的运行时间约为 400us。这不是 GPU 上运行感兴趣的工作负载类型。作为比较，randn 比这个函数慢了大约 10 倍。

使用辛普森规则加速集成二维 gpuArray 矩阵的代码

Speeding up code which integrates a 2D gpuArray matrix using Simpson's Rule

optimization

matlab

gpu

vectorization

转置实施