MATLAB 高效直方图查找

MATLAB efficient histogram look up

我有一个大型 3 维矩阵(大约为 1000x1000x100),其中包含对应于归一化高分辨率直方图中的 bin 的值。第 3 个矩阵维度中的每个索引都有一个直方图(例如,示例维度有 100 个直方图)。

检查找到 2D 索引值的概率(即与归一化直方图中的 bin 关联的值)的最快方法是什么?

我现在的代码非常慢:

probs = zeros(rows, cols, dims);
for k = 1 : dims
    tmp = data(:,:,k);
    [h, centers] = hist(tmp, 1000);
    h = h / sum(h); % Normalize the histogram
    for r = 1 : rows
        for c = 1 : cols
            % Identify bin center closest to value
            [~, idx] = min(abs(centers - data(r, c, k)));
            probs(r,c,k) = h(idx);
        end
    end
end

For 循环通常(虽然不总是)比向量化代码效率低,嵌套 for 循环通常更差。我怎样才能用更少的循环来做到这一点,同时又不会 运行ning 内存不足?我尝试了一些 repmat 调用来矢量化整个过程,但是我的 MATLAB 会话因 1000x1000x1000x100 矩阵而崩溃。

注意:我只有 MATLAB 2014a,所以虽然解决方案使用新的 histogram() function are welcome, I'm still stuck with hist().

这是一个小规模的演示示例,应该 运行 可以复制:

rng(2); % Seed the RNG for repeatability
rows = 3;
cols = 3;
dims = 2;
data = repmat(1:3,3,1,2);
probs = zeros(rows, cols, dims);
for k = 1 : dims
    tmp = normrnd(0,1,1000,1);
    [h, centers] = hist(tmp);
    h = h / sum(h); % Normalize the histogram
    for r = 1 : rows
        for c = 1 : cols
            % Identify bin center closest to value
            [~, idx] = min(abs(centers - data(r, c, k)));
            probs(r,c,k) = h(idx);
        end
    end
end

当我运行上面的代码时,我得到了以下输出(这是合乎逻辑的,因为直方图是正常的高斯分布):

probs(:,:,1) =

0.1370    0.0570    0.0030
0.1370    0.0570    0.0030
0.1370    0.0570    0.0030


probs(:,:,2) =

0.1330    0.0450    0.0050
0.1330    0.0450    0.0050
0.1330    0.0450    0.0050

注意:我在下面的答案中找到了一个有效的解决方案。

我假设您有一个矩阵 centersAll(具有 dims 行),其中包含每个第三维索引的直方图中心,以及一个类似的矩阵 hAll(具有 dims 行)包含直方图值。

centersAll 重塑为第三和第四维,使用 bsxfun 计算差异,沿第四维最小化,并使用它索引 hAll:

[~, idx] = min(abs(bsxfun(@minus, data, reshape(centersAll,1,1,dims,[]))), [], 4);
hAllt = hAll.'; %'
probs2 = hAllt(bsxfun(@plus, idx, reshape(0:dims-1, 1,1,[])*size(hAll,2)));

检查:

%// Data
clear all
rng(2); % Seed the RNG for repeatability
rows = 3;
cols = 3;
dims = 2;
data = repmat(1:3,3,1,2);
for k = 1 : dims
    tmp = normrnd(0,1,1000,1);
    [h, centers] = hist(tmp);
    h = h / sum(h); % Normalize the histogram                   
    centersAll(k,:) = centers;
    hAll(k,:) = h;
end

%// With loops
probs = zeros(rows, cols, dims);
for k = 1 : dims
    for r = 1 : rows
        for c = 1 : cols
            % Identify bin center closest to value
            centers = centersAll(k,:);
            h = hAll(k,:);
            [~, idx] = min(abs(centers - data(r, c, k)));
            probs(r,c,k) = h(idx);
        end
    end
end

%// Without loops
[~, idx] = min(abs(bsxfun(@minus, data, reshape(centersAll,1,1,dims,[]))), [], 4);
hAllt = hAll.'; %'
probs2 = hAllt(bsxfun(@plus, idx, reshape(0:dims-1, 1,1,[])*size(hAll,2)));

%// Check
probs==probs2

给予

ans(:,:,1) =
     1     1     1
     1     1     1
     1     1     1
ans(:,:,2) =
     1     1     1
     1     1     1
     1     1     1

最佳解决方案:

未同时使用新的 histogram() function (i.e. all versions before R2014b), the best way to do this is to take advantage of both the hist() function and the histc() 函数。

还有两种情况需要考虑:

  1. 对数据进行分箱,然后在直方图中查找相同的数据
  2. 在由不同数据形成的直方图中查找 bins

第一种情况比较简单。 histc() 的一个很好的特性是它 returns 直方图和数据被分箱到的直方图的索引。至此,我们应该大功告成了。唉!可悲的是我们不是。因为 histc()hist() 背后的代码对数据进行不同的分箱,我们最终得到两个不同的直方图,具体取决于我们使用的是哪个。原因似乎是 histc() 根据 严格大于 标准选择 bin,而 hist() 使用 大于 - 选择 bin或等于。结果,等效函数调用:

% Using histc
binEdges = linspace(min(tmp),max(tmp),numBins+1);
[h1, indices] = histc(data, binEdges);

% Using hist
[h2, indices] = hist(data, numBins);

导致不同的直方图:length(h1) - length(h2) = 1.

因此,为了处理这个问题,我们只需将 h1 的最后一个 bin 中的值添加到 h1 的倒数第二个 bin 中的值,去掉最后一个 bin,然后相应地调整指数:

% Account for "strictly greater than" bug that results in an extra bin

h1(:, numBins) = h1(:, numBins) + h1(:, end); % Combine last two bins
indices(indices == numBins + 1) = numBins; % Adjust indices to point to right spot
h1 = h1(:, 1:end-1); % Lop off the extra bin

现在您剩下一个 h1 匹配 h2indices 向量对应于您的数据在 h1 中的位置。因此,您可以通过高效的索引而不是循环来查找概率信息。

可运行的示例代码:

rng(2); % Seed the RNG for repeatability

% Generate some data
numBins = 6;
data = repmat(rand(1,5), 3, 1, 2);
[rows, cols, dims] = size(data);
N = rows*cols;

% Bin all data into a histogram, keeping track of which bin each data point
% gets mapped to
h = zeros(dims, numBins + 1);
indices = zeros(dims, N);
for k = 1 : dims
    tmp = data(:,:,k);
    tmp = tmp(:)';
    binEdges = linspace(min(tmp),max(tmp),numBins+1);
    [h(k,:), indices(k,:)] = histc(tmp, binEdges);
end

% Account for "strictly greater than" bug that results in an extra bin
h(:, numBins) = h(:, numBins) + h(:, end); % Add count in last bin to the second-to-last bin
indices(indices == numBins + 1) = numBins; % Adjust indices accordingly
h = h(:,1:end-1); % Lop off the extra bin
h = h ./ repmat(sum(h,2), 1, numBins); % Normalize all histograms

% Now we can efficiently look up probabilities by indexing instead of
% looping
for k = 1 : dims
    probs(:, :, k) = reshape(h(sub2ind(size(h), repmat(k, 1, size(indices, 2)), indices(k,:))), rows, cols);
end
probs

在第二种情况下,查找更加困难,因为您没有在直方图创建期间跟踪 bin 索引的奢侈。但是,我们可以通过构建与第一个具有相同分箱的第二个直方图并在分箱过程中跟踪索引来解决此问题。

对于这种方法,您首先使用 hist() 在一些直方图训练数据上计算初始直方图。您需要存储的只是该训练数据的最小值和最大值。有了这些信息,我们可以使用 linspace()histc() 生成相同的直方图,调整 histc() 给出的 extra-bin "bug"。

这里的关键是处理异常数据。也就是说,新数据集中的数据落在预先计算的直方图之外。由于应该为它分配 frequency/probability 0,我们只需向预先计算的直方图中添加一个值为 0 的额外分箱,然后我们将任何未分​​箱的新数据映射到该索引。

这是第二种方法的一些注释的可运行代码:

% PRE-COMPUTE A HISTOGRAM
rng(2); % Seed the RNG for repeatability

% Build some data
numBins = 6;
old_data = repmat(rand(1,5), 3, 1, 2);
[rows, cols, dims] = size(old_data);

% Store min and max of each histogram for reconstruction process
min_val = min(old_data, [], 2);
max_val = max(old_data, [], 2);

% Just use hist() function while specifying number of bins this time
% No need to track indices because we are going to be using this histogram
% as a reference for looking up a different set of data
h = zeros(dims, numBins);
for k = 1 : dims
    tmp = old_data(:,:,k);
    tmp = tmp(:)';
    h(k,:) = hist(tmp, numBins);
end
h = h ./ repmat(sum(h, 2), 1, numBins); % Normalize histograms
h(:, end + 1) = 0; % Map to here any data to that falls outside the pre-computed histogram

% NEW DATA
rng(3); % Seed RNG again for repeatability

% Generate some new data
new_data = repmat(rand(1,4), 4, 1, 2); % NOTE: Doesn't have to be same size
[rows, cols, dims] = size(new_data);
N = rows*cols;


% Bin new data with histc() using boundaries from pre-computed histogram
h_new = zeros(dims, numBins + 1);
indices_new = zeros(dims, N);
for k = 1 : dims
    tmp = new_data(:,:,k);
    tmp = tmp(:)';

    % Determine bins for new histogram with the same boundaries as
    % pre-computed one. This ensures that our resulting histograms are
    % identical, except for the "greater-than" bug which is accounted for
    % below.
    binEdges = linspace(min_val(k), max_val(k), numBins+1);
    [h_new(k,:), indices_new(k,:)] = histc(tmp, binEdges);
end

% Adjust for the "greater-than" bug
% When adjusting this histogram, we are directing outliers that don't
% fit into the pre-computed histogram to look up probabilities from that 
% extra bin we added to the pre-computed histogram.
h_new(:, numBins) = h_new(:, numBins) + h_new(:, end); % Add count in last bin to the second-to-last bin
indices_new(indices_new == numBins + 1) = numBins; % Adjust indices accordingly
indices_new(indices_new == 0) = numBins + 1; % Direct any unbinned data to the 0-probability last bin
h_new = h_new ./ repmat(sum(h_new,2), 1, numBins + 1); % Normalize all histograms

% Now we should have all of the new data binned into a histogram
% that matches the pre-computed one. The catch is, we now have the indices
% of the bins the new data was matched to. Thus, we can now use the same
% efficient indexing-based look-up strategy as before to get probabilities
% from the pre-computed histogram.
for k = 1 : dims
    probs(:, :, k) = reshape(h(sub2ind(size(h), repmat(k, 1, size(indices_new, 2)), indices_new(k,:))), rows, cols);
end
probs