如何解释 TensorBoard 中的 "Value Loss" 图表?

How to interpret "Value Loss" chart in TensorBoard?

我在 Unity 机器学习代理中有一架寻找目标、避障的直升机。查看用于我训练的 TensorBoard,我试图了解如何解释 "Losses/Value Loss"。

我在谷歌上搜索了很多关于 ML Loss 的文章,比如 this one,但我似乎无法直观地理解这对我的小直升机意味着什么以及我应该实施的可能的改变,如果有的话。 (直升飞机因靠近目标而再次靠近目标而获得奖励,并因距离更远或相撞而受到惩罚。它测量各种东西,如相对速度、相对目标位置、射线传感器等,它基本上在目标中工作-寻找,而更复杂的迷宫类型障碍尚未经过测试或训练。它使用 3 层。)谢谢!

In reinforcement learning and specifically regarding actor/critic algorithms, value loss is the difference (or an average of many such differences) between the learning algorithm's expectation of a state's value and the empirically observed value of that state.

What is a state's value? A state's value is, in short, how much reward you can expect given that you start in that state. Immediate reward contributes completely to this amount. Reward that can possibly occur but not immediately contribute less, with more distant occurrences contributing less and less. We call this reduction in contribution to value a "discount", or we say that these rewards are "discounted".

Expected value is how much the critic part of the algorithm predicts the value to be. In the case of a critic implemented as a neural network, it's the output of the neural network with the state as its input.

Empirically observed value is the amount you get when you add up the rewards that you actually got when you left that state, plus any rewards (discounted by some amount) you got immediately after that for some number of steps (we'll say after these steps you ended up on state X), and (perhaps, depending on implementation) plus some discounted amount based on the value of state X.

In short, the smaller it is, the better it got at predicting how well it is going to perform. This doesn't mean that it gets better at playing - after all, one can be terrible at a game yet be accurate at predicting that they will lose and when they will lose if they learn to choose actions that will make them lose quickly!