使用 GPU 高效计算频率平均周期图

Efficiently Calculate Frequency Averaged Periodogram Using GPU

在 Matlab 中,我正在寻找一种在 GPU 上最有效地计算频率平均周期图的方法。

我知道最重要的是尽量减少循环并使用已经内置的 GPU 函数。然而,我的代码仍然感觉相对未优化,我想知道我可以对其进行哪些更改以获得更好的速度。

r = 5;  % Dimension
n = 100;  % Time points
m = 20;  % Bandwidth of smoothing

% Generate some random rxn data 
X = rand(r, n);

% Generate normalised weights according to a cos window
w = cos(pi * (-m/2:m/2)/m);
w = w/sum(w);

% Generate non-smoothed Periodogram
FT = (n)^(-0.5)*(ctranspose(fft(ctranspose(X))));
Pdgm = zeros(r, r, n/2 + 1);
for j = 1:n/2 + 1
    Pdgm(:,:,j) = FT(:,j)*FT(:,j)';
end

% Finally smooth with our weights
SmPdgm = zeros(r, r, n/2 + 1);

% Take advantage of the GPU filter function
% Create new Periodogram WrapPdgm with m/2 values wrapped around in front and 
% behind it (it seems like there is redundancy here)
WrapPdgm = zeros(r,r,n/2 + 1 + m);
WrapPdgm(:,:,m/2+1:n/2+m/2+1) = Pdgm;
WrapPdgm(:,:,1:m/2) = flip(Pdgm(:,:,2:m/2+1),3);
WrapPdgm(:,:,n/2+m/2+2:end) = flip(Pdgm(:,:,n/2-m/2+1:end-1),3);

% Perform filtering
for i = 1:r
    for j = 1:r
        temp = filter(w, [1], WrapPdgm(i,j,:));
        SmPdgm(i,j,:) = temp(:,:,m+1:end);
    end
end

特别是,在从傅立叶变换数据计算初始 Pdgm 时,我看不到优化 for 循环的方法,我觉得我在 WrapPdgm 中玩的把戏如果有一个平滑的函数,那么为了在 GPU 上利用 filter() 感觉没有必要。

要摆脱第一个循环,您可以执行以下操作:

Pdgm_cell = cellfun(@(x) x * x', mat2cell(FT(:, 1 : 51), [5], ones(51, 1)), 'UniformOutput', false);
Pdgm = reshape(cell2mat(Pdgm_cell),5,5,[]);

然后在您的过滤器中,您可以执行以下操作:

temp = filter(w, 1, WrapPdgm, [], 3);
SmPdgm = temp(:, :, m + 1 : end);

3 让过滤器知道沿着数据的第 3 个维度进行操作。

您可以在 GPU 上使用 pagefun 进行第一个循环。 (请注意,cellfun 的实现基本上是一个隐藏循环,而 pagefun 使用批处理 GEMM 操作在 GPU 上本地运行)。方法如下:

n = 16;
r = 8;
X = gpuArray.rand(r, n);
R = gpuArray.zeros(r, r, n/2 + 1);
for jj = 1:(n/2+1)
    R(:,:,jj) = X(:,jj) * X(:,jj)';
end
X2 = X(:,1:(n/2+1));
R2 = pagefun(@mtimes, reshape(X2, r, 1, []), reshape(X2, 1, r, []));
R - R2

解决方案代码

这似乎非常有效,因为下一节中的基准运行时可能会让我们信服 -

%// Select the portion of FT to be processed and 
%// send copy to GPU for calculating everything
gFT = gpuArray(FT(:,1:n/2 + 1));

%// Perform non-smoothed Periodogram, thus removing the first loop
Pdgm1 = bsxfun(@times,permute(gFT,[1 3 2]),permute(conj(gFT),[3 1 2]));

%// Generate WrapPdgm right on GPU
WrapPdgm1 = zeros(r,r,n/2 + 1 + m,'gpuArray');
WrapPdgm1(:,:,m/2+1:n/2+m/2+1) = Pdgm1;
WrapPdgm1(:,:,1:m/2) = Pdgm1(:,:,m/2+1:-1:2);
WrapPdgm1(:,:,n/2+m/2+2:end) = Pdgm1(:,:,end-1:-1:n/2-m/2+1);

%// Perform filtering on GPU and get the final output, SmPdgm1
filt_data = filter(w,1,reshape(WrapPdgm1,r*r,[]),[],2);
SmPdgm1 = gather(reshape(filt_data(:,m+1:end),r,r,[]));

基准测试

基准代码

%// Input parameters
r = 50; % Dimension
n = 1000;  % Time points
m = 200;  % Bandwidth of smoothing

% Generate some random rxn data
X = rand(r, n);

% Generate normalised weights according to a cos window
w = cos(pi * (-m/2:m/2)/m);
w = w/sum(w);

% Generate non-smoothed Periodogram
FT = (n)^(-0.5)*(ctranspose(fft(ctranspose(X))));

tic,    %// ... Code from original approach,   toc
tic     %// ... Code from proposed approach,   toc

因此在 GPU, GTX 750 Ti 上针对 CPU, I-7 4790K -

获得的运行时结果
------------------------------ With Original Approach on CPU
Elapsed time is 0.279816 seconds.
------------------------------ With Proposed Approach on GPU
Elapsed time is 0.169969 seconds.