使用 GPU 高效计算频率平均周期图
Efficiently Calculate Frequency Averaged Periodogram Using GPU
在 Matlab 中,我正在寻找一种在 GPU 上最有效地计算频率平均周期图的方法。
我知道最重要的是尽量减少循环并使用已经内置的 GPU 函数。然而,我的代码仍然感觉相对未优化,我想知道我可以对其进行哪些更改以获得更好的速度。
r = 5; % Dimension
n = 100; % Time points
m = 20; % Bandwidth of smoothing
% Generate some random rxn data
X = rand(r, n);
% Generate normalised weights according to a cos window
w = cos(pi * (-m/2:m/2)/m);
w = w/sum(w);
% Generate non-smoothed Periodogram
FT = (n)^(-0.5)*(ctranspose(fft(ctranspose(X))));
Pdgm = zeros(r, r, n/2 + 1);
for j = 1:n/2 + 1
Pdgm(:,:,j) = FT(:,j)*FT(:,j)';
end
% Finally smooth with our weights
SmPdgm = zeros(r, r, n/2 + 1);
% Take advantage of the GPU filter function
% Create new Periodogram WrapPdgm with m/2 values wrapped around in front and
% behind it (it seems like there is redundancy here)
WrapPdgm = zeros(r,r,n/2 + 1 + m);
WrapPdgm(:,:,m/2+1:n/2+m/2+1) = Pdgm;
WrapPdgm(:,:,1:m/2) = flip(Pdgm(:,:,2:m/2+1),3);
WrapPdgm(:,:,n/2+m/2+2:end) = flip(Pdgm(:,:,n/2-m/2+1:end-1),3);
% Perform filtering
for i = 1:r
for j = 1:r
temp = filter(w, [1], WrapPdgm(i,j,:));
SmPdgm(i,j,:) = temp(:,:,m+1:end);
end
end
特别是,在从傅立叶变换数据计算初始 Pdgm
时,我看不到优化 for 循环的方法,我觉得我在 WrapPdgm
中玩的把戏如果有一个平滑的函数,那么为了在 GPU 上利用 filter()
感觉没有必要。
要摆脱第一个循环,您可以执行以下操作:
Pdgm_cell = cellfun(@(x) x * x', mat2cell(FT(:, 1 : 51), [5], ones(51, 1)), 'UniformOutput', false);
Pdgm = reshape(cell2mat(Pdgm_cell),5,5,[]);
然后在您的过滤器中,您可以执行以下操作:
temp = filter(w, 1, WrapPdgm, [], 3);
SmPdgm = temp(:, :, m + 1 : end);
3 让过滤器知道沿着数据的第 3 个维度进行操作。
您可以在 GPU 上使用 pagefun
进行第一个循环。 (请注意,cellfun
的实现基本上是一个隐藏循环,而 pagefun
使用批处理 GEMM 操作在 GPU 上本地运行)。方法如下:
n = 16;
r = 8;
X = gpuArray.rand(r, n);
R = gpuArray.zeros(r, r, n/2 + 1);
for jj = 1:(n/2+1)
R(:,:,jj) = X(:,jj) * X(:,jj)';
end
X2 = X(:,1:(n/2+1));
R2 = pagefun(@mtimes, reshape(X2, r, 1, []), reshape(X2, 1, r, []));
R - R2
解决方案代码
这似乎非常有效,因为下一节中的基准运行时可能会让我们信服 -
%// Select the portion of FT to be processed and
%// send copy to GPU for calculating everything
gFT = gpuArray(FT(:,1:n/2 + 1));
%// Perform non-smoothed Periodogram, thus removing the first loop
Pdgm1 = bsxfun(@times,permute(gFT,[1 3 2]),permute(conj(gFT),[3 1 2]));
%// Generate WrapPdgm right on GPU
WrapPdgm1 = zeros(r,r,n/2 + 1 + m,'gpuArray');
WrapPdgm1(:,:,m/2+1:n/2+m/2+1) = Pdgm1;
WrapPdgm1(:,:,1:m/2) = Pdgm1(:,:,m/2+1:-1:2);
WrapPdgm1(:,:,n/2+m/2+2:end) = Pdgm1(:,:,end-1:-1:n/2-m/2+1);
%// Perform filtering on GPU and get the final output, SmPdgm1
filt_data = filter(w,1,reshape(WrapPdgm1,r*r,[]),[],2);
SmPdgm1 = gather(reshape(filt_data(:,m+1:end),r,r,[]));
基准测试
基准代码
%// Input parameters
r = 50; % Dimension
n = 1000; % Time points
m = 200; % Bandwidth of smoothing
% Generate some random rxn data
X = rand(r, n);
% Generate normalised weights according to a cos window
w = cos(pi * (-m/2:m/2)/m);
w = w/sum(w);
% Generate non-smoothed Periodogram
FT = (n)^(-0.5)*(ctranspose(fft(ctranspose(X))));
tic, %// ... Code from original approach, toc
tic %// ... Code from proposed approach, toc
因此在 GPU, GTX 750 Ti
上针对 CPU, I-7 4790K
-
获得的运行时结果
------------------------------ With Original Approach on CPU
Elapsed time is 0.279816 seconds.
------------------------------ With Proposed Approach on GPU
Elapsed time is 0.169969 seconds.
在 Matlab 中,我正在寻找一种在 GPU 上最有效地计算频率平均周期图的方法。
我知道最重要的是尽量减少循环并使用已经内置的 GPU 函数。然而,我的代码仍然感觉相对未优化,我想知道我可以对其进行哪些更改以获得更好的速度。
r = 5; % Dimension
n = 100; % Time points
m = 20; % Bandwidth of smoothing
% Generate some random rxn data
X = rand(r, n);
% Generate normalised weights according to a cos window
w = cos(pi * (-m/2:m/2)/m);
w = w/sum(w);
% Generate non-smoothed Periodogram
FT = (n)^(-0.5)*(ctranspose(fft(ctranspose(X))));
Pdgm = zeros(r, r, n/2 + 1);
for j = 1:n/2 + 1
Pdgm(:,:,j) = FT(:,j)*FT(:,j)';
end
% Finally smooth with our weights
SmPdgm = zeros(r, r, n/2 + 1);
% Take advantage of the GPU filter function
% Create new Periodogram WrapPdgm with m/2 values wrapped around in front and
% behind it (it seems like there is redundancy here)
WrapPdgm = zeros(r,r,n/2 + 1 + m);
WrapPdgm(:,:,m/2+1:n/2+m/2+1) = Pdgm;
WrapPdgm(:,:,1:m/2) = flip(Pdgm(:,:,2:m/2+1),3);
WrapPdgm(:,:,n/2+m/2+2:end) = flip(Pdgm(:,:,n/2-m/2+1:end-1),3);
% Perform filtering
for i = 1:r
for j = 1:r
temp = filter(w, [1], WrapPdgm(i,j,:));
SmPdgm(i,j,:) = temp(:,:,m+1:end);
end
end
特别是,在从傅立叶变换数据计算初始 Pdgm
时,我看不到优化 for 循环的方法,我觉得我在 WrapPdgm
中玩的把戏如果有一个平滑的函数,那么为了在 GPU 上利用 filter()
感觉没有必要。
要摆脱第一个循环,您可以执行以下操作:
Pdgm_cell = cellfun(@(x) x * x', mat2cell(FT(:, 1 : 51), [5], ones(51, 1)), 'UniformOutput', false);
Pdgm = reshape(cell2mat(Pdgm_cell),5,5,[]);
然后在您的过滤器中,您可以执行以下操作:
temp = filter(w, 1, WrapPdgm, [], 3);
SmPdgm = temp(:, :, m + 1 : end);
3 让过滤器知道沿着数据的第 3 个维度进行操作。
您可以在 GPU 上使用 pagefun
进行第一个循环。 (请注意,cellfun
的实现基本上是一个隐藏循环,而 pagefun
使用批处理 GEMM 操作在 GPU 上本地运行)。方法如下:
n = 16;
r = 8;
X = gpuArray.rand(r, n);
R = gpuArray.zeros(r, r, n/2 + 1);
for jj = 1:(n/2+1)
R(:,:,jj) = X(:,jj) * X(:,jj)';
end
X2 = X(:,1:(n/2+1));
R2 = pagefun(@mtimes, reshape(X2, r, 1, []), reshape(X2, 1, r, []));
R - R2
解决方案代码
这似乎非常有效,因为下一节中的基准运行时可能会让我们信服 -
%// Select the portion of FT to be processed and
%// send copy to GPU for calculating everything
gFT = gpuArray(FT(:,1:n/2 + 1));
%// Perform non-smoothed Periodogram, thus removing the first loop
Pdgm1 = bsxfun(@times,permute(gFT,[1 3 2]),permute(conj(gFT),[3 1 2]));
%// Generate WrapPdgm right on GPU
WrapPdgm1 = zeros(r,r,n/2 + 1 + m,'gpuArray');
WrapPdgm1(:,:,m/2+1:n/2+m/2+1) = Pdgm1;
WrapPdgm1(:,:,1:m/2) = Pdgm1(:,:,m/2+1:-1:2);
WrapPdgm1(:,:,n/2+m/2+2:end) = Pdgm1(:,:,end-1:-1:n/2-m/2+1);
%// Perform filtering on GPU and get the final output, SmPdgm1
filt_data = filter(w,1,reshape(WrapPdgm1,r*r,[]),[],2);
SmPdgm1 = gather(reshape(filt_data(:,m+1:end),r,r,[]));
基准测试
基准代码
%// Input parameters
r = 50; % Dimension
n = 1000; % Time points
m = 200; % Bandwidth of smoothing
% Generate some random rxn data
X = rand(r, n);
% Generate normalised weights according to a cos window
w = cos(pi * (-m/2:m/2)/m);
w = w/sum(w);
% Generate non-smoothed Periodogram
FT = (n)^(-0.5)*(ctranspose(fft(ctranspose(X))));
tic, %// ... Code from original approach, toc
tic %// ... Code from proposed approach, toc
因此在 GPU, GTX 750 Ti
上针对 CPU, I-7 4790K
-
------------------------------ With Original Approach on CPU
Elapsed time is 0.279816 seconds.
------------------------------ With Proposed Approach on GPU
Elapsed time is 0.169969 seconds.