如何理解函数中的"calculate the priors based on occurence in the training set"

Question

我有一个工具箱中的函数，我把它粘贴在这里。最后一段看不懂从“% // calculate the priors based on occurence in the training set”开始？能有人为我解释一下吗？非常感谢！

function [scratch] = train_gnb(trainpats,traintargs, in_args, cv_args)

% // Use a Gaussian Naive Bayes classifier to learn regressors.
%
% // [SCRATCH] = TRAIN_GNB(TRAINPATS, TRAINTARGS, IN_ARGS, CV_ARGS)
%
% // The Gaussian Naive Bayes classifier makes the assumption that
% // each data point is conditionally independent of the others, given
% // a class label, and that, furthermore, the likelihood function for
% // each class is normal.  The likelihood of a given data point X,
% // where Y is one of K labels, is thus:
%
% // Pr ( X | Y==K) = Product_N ( Normal(X_N | theta_K) ) 
% 
% // The GNB is trained by finding the Normal MLE's for each subset of
% // the training set that have the same label.  Each voxel has a
% // scalar mean and a scalar variance.
%
% // OPTIONAL ARGUMENTS:
%
% // UNIFORM_PRIOR (default = true): If uniform_prior is true,
% // then the algorithm will assume that no classes are
% // inherently more likely than others, and will use 1/K as
% // the prior probability for each of K classes.  If
% // uniform_prior is false, then train_gnb will estimate the
% // priors from the data using laplace smoothing: if N_k is
% // the number of times class k is observed in the training
% // set and N is the total number of training datapoints, then
% // Pr(Y == k) = (N_k + 1) / (N + K).  This way, no cluster is
% // ever assigned a 0 prior.

% // License:
% // =====================================================================
%
% // This is part of the Princeton MVPA toolbox, released under
% // the GPL. See http://www.csbmb.princeton.edu/mvpa for more
% // information.
% 
%  // The Princeton MVPA toolbox is available free and
% // unsupported to those who might find it useful. We do not
% // take any responsibility whatsoever for any problems that
% // you have related to the use of the MVPA toolbox.
%
% // ======================================================================

defaults.uniform_prior = true;

args = mergestructs(in_args, defaults);

nConds = size(traintargs,1);
[nVox nTimepoints] = size(trainpats);

% // find a gaussian distribution for each voxel for each category

scratch.mu = NaN(nVox, nConds);
scratch.sigma = NaN(nVox, nConds);

for k = 1:nConds

  % // grab the subset of the data with a label of category k
    k_idx = find(traintargs(k, :) == 1);

    if numel(k_idx) < 1
      error('Condition %g has no data points.', k);
    end

    data = trainpats(:, k_idx);

    % calculate the maximum likelihood estimators (mean and variance)
    [ mu_hat, sigma_hat] = normfit(data');

    scratch.mu(:,k) = mu_hat;
    scratch.sigma(:,k) = sigma_hat;

end

% // calculate the priors based on occurence in the training set
scratch.prior = NaN(nConds, 1);
if (args.uniform_prior)
  scratch.prior = ones(nConds,1) / nConds;
else

  for k = 1:nConds  
    scratch.prior(k) = (1 + numel( find(traintargs(k, :) == 1))) / ...
        (nConds + nTimepoints);    
  end

end

Answer 1

"prior" 是 "prior distribution"，这是描述每个 class 的可能性的分布。当需要查看新数据点并根据您的训练数据决定它是哪个 class 时，这很重要。如果你先验地知道一个class比另一个class更有可能发生，它会影响对新点所属的class的决定。

先验分布的一个常见假设是 "uniform prior" 这意味着，当您去测试一个新的数据点时，我们假设每个 class 与任何其他 class。统一先验是一个很好的假设，但可能无法很好地模拟数据。

更好的模型是假设您的训练数据很好地代表了所有数据。然后，您测量训练数据中每个 class 的分布。这成为你的先验。

因此，回到您的示例代码，您的问题是关于定义先验的代码部分。这部分代码在代码顶部的块注释中进行了描述。请参阅以下部分：

% UNIFORM_PRIOR (default = true): If uniform_prior is true,
% then the algorithm will assume that no classes are
% inherently more likely than others, and will use 1/K as
% the prior probability for each of K classes.  If
% uniform_prior is false, then train_gnb will estimate the
% priors from the data using laplace smoothing: if N_k is
% the number of times class k is observed in the training
% set and N is the total number of training datapoints, then
% Pr(Y == k) = (N_k + 1) / (N + K).  This way, no cluster is
% ever assigned a 0 prior.

在代码本身中，您会看到初始的 if (args.uniform_prior)，它确定您是否先假设制服....

如果您假设先验均匀，则行scratch.prior = ones(nConds,1) / nConds; 将先验设置为所有相同的值...即均匀分布。显然 classes 的数量由 nConds 定义，因此新数据点在任何一个 class 中的可能性基本上是 1 / nConds.

如果您不是假设统一先验，for循环会遍历您的训练数据并计算每个class的出现次数。 ..通过 numel( find(traintargs(k, :) == 1)) 行的部分。这行代码的其余部分使用（我猜）顶部块注释中讨论的拉普拉斯平滑技术对该值进行归一化和平滑处理。

希望对您有所帮助！

芯片

如何理解函数中的"calculate the priors based on occurence in the training set"

How to understand the "calculate the priors based on occurence in the training set" in the function

matlab

machine-learning

bayesian