tf.keras中的A2C算法：演员损失函数

Question

我正在学习 Action-Critic 强化学习技术，尤其是 A2C 算法。

我在此处找到了对算法简单版本（即没有经验重播、批处理或其他技巧）的很好描述，并在此处实现：https://link.medium.com/yi55uKWwV2. The complete code from that article is available on GitHub。

我想我大概理解这里发生的事情，但为了确保我真的理解，我正在尝试使用更高级别的 tf.keras API 从头开始重新实现它。我遇到困难的地方是如何正确实施训练循环，以及如何制定 actor 的损失函数。

将动作和优势传递给损失函数的正确方法是什么？
Actor 的损失函数涉及计算给定正态分布的动作的概率。如何确保损失函数计算期间正态分布的 mu 和 sigma 与预测期间的正态分布实际匹配？
按照原来的方式，演员的损失函数不关心 y_pred，它只关心与环境交互时选择的动作。这似乎是错误的，但我不确定如何。

我目前的代码：https://gist.github.com/nevkontakte/beb59f29e0a8152d99003852887e7de7

编辑：我想我的一些困惑源于对 Keras/TensorFlow 中梯度计算背后魔法的理解不足，因此，如果有任何指点，我们将不胜感激。

Answer 1

据我了解，A2C 是 激活剂-抑制剂系统 的机器学习实现，也称为 双组分反应扩散系统 （https://en.wikipedia.org/wiki/Reaction%E2%80%93diffusion_system). Activator-inhibitor models are important in any field of science as they describe pattern formations like i.e. the Turing mechanism (simply search the net for activator-inhibitor model and you find a vast amount of information, a very common application are predator-prey models). Also cf the graphic 图片来源：https://www.researchgate.net/figure/Activator-Inhibitor-System_fig1_23671770/

附https://towardsdatascience.com/reinforcement-learning-w-keras-openai-actor-critic-models-f084612cfd69中A2C算法的说明图

激活-抑制剂模型与非线性动力系统理论密切相关（或'chaos theory'）这在分叉树类的比较中也变得明显https://medium.com/@asteinbach/rl-introduction-simple-actor-critic-for-continuous-actions-4e22afb712 and the bifurcation tree of a nonlinear dynamical systems like i.e. the logistic map (https://en.wikipedia.org/wiki/Logistic_map 中的结构，逻辑图是最简单的 predator-prey 模型或 activator-inhibitor 模型之一）。另一个相似点是 A2C 模型中对初始条件的 敏感性，描述为

This introduces in inherent high variability in log probabilities (log of the policy distribution) and cumulative reward values, because each trajectories during training can deviate from each other at great degrees.

在https://towardsdatascience.com/understanding-actor-critic-methods-931b97b6df3f和维数灾难也出现在混沌理论中，即吸引子重建

从系统论的角度来看，A2C 算法尝试调整初始值（起始状态），使其在增加动态系统的增长率时在给定端点结束，即逻辑图（r -value 增加并且初始值（起始状态）不断重新适应以在分叉树中选择正确的分叉（动作））

因此，A2C 试图在数值上解决混沌理论问题，即在非线性动力系统的混沌区域中找到给定结果的初始值。分析这个问题在大多数情况下是无解的。

action是分叉树中的分叉点，states是未来的分叉

动作和状态都是由两个耦合的神经网络建模的，这两个神经网络的耦合是A2C算法。

在 https://towardsdatascience.com/reinforcement-learning-w-keras-openai-actor-critic-models-f084612cfd69 中有详细记录的用于实现 A2C 的 keras 代码，因此您可以在那里实现。

这里的损失函数定义为时间差分（TD）函数，即exact与state在实际分叉处的差异点和 state 在估计的 future 之一，但是这个数学上 exactly 定义容易产生随机误差（或噪声），所以随机误差包含在 exact 的定义中，因为最终机器学习是基于随机系统或误差演算，意味着由确定性组成的系统和一个随机成分。要将此误差归零，使用随机梯度下降。在 keras 中，这可以通过选择 optimizer=sge.
来实现。
实际步骤和未来步骤的交互在函数 remember 中作为 memory 在 https://towardsdatascience.com/reinforcement-learning-w-keras-openai-actor-critic-models-f084612cfd69 上实现，并且此函数还 links演员和评论家网络（或激活剂和抑制器网络）。这种试验（动作）、调用预测（TD函数）、记忆和训练（即随机梯度下降）的一般结构是所有强化学习算法的基础，并与结构实际状态，动作，奖励，新状态 :

The prediction code is also very much the same as it was in previous reinforcement learning algorithms. That is, we just have to iterate through the trial and call predict, remember, and train on the agent:

在你的第一个问题的实现中，通过对评论家应用记忆并用这些值训练评论家来解决（这是在主要功能中），其中训练总是评估损失函数，所以action 和 reward 在这个实现中被 remember 传递给损失函数：

actor_critic.remember(cur_state, action, reward, new_state, done) actor_critic.train()

因为你的第二个问题：我不确定，但我认为这是通过优化算法（即随机梯度下降）实现的

第三个问题：在捕食者-猎物模型中，参与者或激活者是猎物，猎物的行为仅由栖息地的大小或容量（草的数量）和捕食者的大小决定（抑制剂）种群，因此以这种方式对其进行建模再次与自然或激活剂-抑制剂系统一致。在 https://towardsdatascience.com/reinforcement-learning-w-keras-openai-actor-critic-models-f084612cfd69 中的 main 函数中，也只训练了 critic 或 inhibitor / predator。

Answer 2

首先，信用到期：ralf htp 和 Simon 提供的信息有助于帮助我找出正确的最终答案。

在详细回答我自己的问题之前，here's the original code I was trying to rewrite in tf.keras terms, and here's my result。

在 Keras 中将动作和优势传递给损失函数的正确方法是什么？

原始 TF 优化器认为损失函数的方式与 Keras 所做的有所不同。当直接使用优化器时，它只需要一个张量（惰性或急切取决于您的配置），它将在 tf.GradientTape() 下进行评估以计算梯度和更新权重。

示例来自 https://medium.com/@asteinbach/actor-critic-using-deep-rl-continuous-mountain-car-in-tensorflow-4c1fb2110f7c：

# Below norm_dist is the output tensor of the neural network we are training.
loss_actor = -tfc.log(norm_dist.prob(action_placeholder) + 1e-5) * delta_placeholder
training_op_actor = tfc.train.AdamOptimizer(
    lr_actor, name='actor_optimizer').minimize(loss_actor)

# Later, in the training loop...

_, loss_actor_val = sess.run([training_op_actor, loss_actor],
                             feed_dict={action_placeholder: np.squeeze(action),
                                        state_placeholder: scale_state(state),
                                        delta_placeholder: td_error})

在这个例子中，它计算了整个图，包括进行推理、捕获梯度和调整权重。因此，要将您需要的任何值传递到损失 function/gradient 计算中，您只需将必要的值传递到计算图中即可。

Keras 有点 more formal 损失函数应该是什么样子：

loss: String (name of objective function), objective function or tf.keras.losses.Loss instance. See tf.keras.losses. An objective function is any callable with the signature scalar_loss = fn(y_true, y_pred). If the model has multiple outputs, you can use a different loss on each output by passing a dictionary or a list of losses. The loss value that will be minimized by the model will then be the sum of all individual losses.

Keras 将为您进行推理（前向传递）并将输出传递给损失函数。损失函数应该对预测值和 y_true 标签以及 return 结果进行一些额外的计算。为了梯度计算的目的，将跟踪整个过程。

虽然对于传统的训练来说很方便，但是当我们想传入一些额外的数据时，比如TD误差，这就有点限制了。可以解决这个问题，将所有额外数据推入 y_true，并在损失函数内将其拉开（我在网络上的某个地方发现了这个技巧，但不幸的是丢失了 link 来源） .

最后我重写了上面的内容：

def loss(y_true, y_pred):
    action_true = y_true[:, :n_outputs]
    advantage = y_true[:, n_outputs:]
    return -tfc.log(y_pred.prob(action_true) + 1e-5) * advantage

# Below, in the training loop...

# A trick to pass TD error *and* actual action to the loss function: join them into a tensor and split apart
# Inside the loss function.
annotated_action = tf.concat([action, td_error], axis=1)
actor_model.train_on_batch([scale_state(state)], [annotated_action])

Actor 的损失函数涉及计算给定正态分布的动作的概率。如何确保损失函数计算期间正态分布的 mu 和 sigma 与预测期间的正态分布实际匹配？

当我问这个问题时，我对 TF 计算图的工作原理还不够了解。所以答案很简单：每次调用 sess.run() 时，它都必须从头开始计算整个图。只要图形输入（例如观察状态）和 NN 权重相同（或相似），分布的参数就会相同（或相似）。

按照原来的方式，演员的损失函数不关心 y_pred，它只关心与环境交互时选择的动作。这似乎是错误的，但我不确定如何。

错误的是假设 "the actor's loss function doesn't care about y_pred" :) Actor 的损失函数涉及 norm_dist（即动作概率分布），在这种情况下它实际上是 y_pred 的类比。

tf.keras中的A2C算法：演员损失函数

A2C algorithm in tf.keras: actor loss function

python

reinforcement-learning

keras

tensorflow

在 Keras 中将动作和优势传递给损失函数的正确方法是什么？

Actor 的损失函数涉及计算给定正态分布的动作的概率。如何确保损失函数计算期间正态分布的 mu 和 sigma 与预测期间的正态分布实际匹配？

按照原来的方式，演员的损失函数不关心 y_pred，它只关心与环境交互时选择的动作。这似乎是错误的，但我不确定如何。