如何在头部姿势估计问题中将概率转换为角度?

How to convert probability to angle degree in a head-pose estimation problem?

我重复使用了其他人的代码,以欧拉角进行头部姿势预测。作者训练了一个分类网络,returns bin 分类结果的三个角度,即偏航,滚动,俯仰。 bins 的数量是 66。他们以某种方式将概率转换为相应的角度,如第 150 行到 152 here 所写。谁能帮忙解释一下公式?

这些是上述文件中的相关代码行:

[56]  model = hopenet.Hopenet(torchvision.models.resnet.Bottleneck, [3, 4, 6, 3], 66) # a variant of ResNet50
[80]  idx_tensor = [idx for idx in xrange(66)]
[81]  idx_tensor = torch.FloatTensor(idx_tensor).cuda(gpu)
[144] yaw, pitch, roll = model(img)
[146] yaw_predicted = F.softmax(yaw)
[150] yaw_predicted = torch.sum(yaw_predicted.data[0] * idx_tensor) * 3 - 99

如果我们查看 training code, and the authors' paper,* 我们会看到损失函数是两个损失的总和:

  1. 原始模型输出(每个 bin 类别的概率向量):
[144] yaw, pitch, roll = model(img)
  1. bin预测的线性组合(预测的连续角度):
[146] yaw_predicted = F.softmax(yaw)
[150] yaw_predicted = torch.sum(yaw_predicted.data[0] * idx_tensor) * 3 - 99

由于 3 * softmax(label_weighted_sum(output)) - 99 是训练回归损失的最后一层(但不是模型 forward 的明确部分),因此必须将其应用于原始输出以将其从bin 概率向量到单个角度预测。


*

3.2. The Multi-Loss Approach

All previous work which predicted head pose using convolutional networks regressed all three Euler angles directly using a mean squared error loss. We notice that this approach does not achieve the best results on our large-scale synthetic training data.

We propose to use three separate losses, one for each angle. Each loss is a combination of two components: a binned pose classification and a regression component. Any backbone network can be used and augmented with three fully-connected layers which predict the angles. These three fully-connected layers share the previous convolutional layers of the network.

The idea behind this approach is that by performing bin classification we use the very stable softmax layer and cross-entropy, thus the network learns to predict the neighbourhood of the pose in a robust fashion. By having three cross-entropy losses, one for each Euler angle, we have three signals which are backpropagated into the network which improves learning. In order to obtain a fine-grained predictions we compute the expectation of each output angle for the binned output. The detailed architecture is shown in Figure 2.

We then add a regression loss to the network, namely a mean-squared error loss, in order to improve fine-grained predictions. We have three final losses, one for each angle, and each is a linear combination of both the respective classification and the regression losses. We vary the weight of the regression loss in Section 4.4 and we hold the weight of the classification loss constant at 1. The final loss for each Euler angle is the following:

Where H and MSE respectively designate the crossentropy and mean squared error loss functions.