Q 学习与时间差异与基于模型的强化学习

Q-learning vs temporal-difference vs model-based reinforcement learning

我在大学学习一个名为 "Intelligent Machines" 的课程。向我们介绍了 3 种强化学习方法，通过这些方法我们获得了何时使用它们的直觉，我引用：

Q-Learning - 无法解决 MDP 时的最佳选择。
时间差异学习 - 当 MDP 已知或可以学习但无法解决时最好。
基于模型 - 无法学习 MDP 时的最佳选择。

是否有任何很好的例子来解释何时选择一种方法而不是另一种方法？

时间差是an approach to learning how to predict a quantity that depends on future values of a given signal. It can be used to learn both the V-function and the Q-function, whereas Q-learning is a specific TD algorithm used to learn the Q-function. As stated by Don Reba, you need the Q-function to perform an action (e.g., following an epsilon-greedy policy). If you have only the V-function you can still derive the Q-function by iterating over all the possible next states and choosing the action which leads you to the state with the highest V-value. For examples and more insights, I recommend the classic book from Sutton and Barto。

在 model-free RL 中，您不会学习状态转换函数（模型），您只能依靠样品。但是，您可能也对学习它感兴趣，例如因为您无法收集很多样本并希望生成一些虚拟样本。在这种情况下，我们讨论 基于模型的 强化学习。基于模型的 RL 在机器人技术中很常见，您无法在其中执行许多真实模拟，否则机器人会损坏。 This is a good survey with many examples (but it only talks about policy search algorithms). For another example have a look at this paper。在这里，作者学习 - 连同策略 - 一个高斯过程来近似机器人的前向模型，以模拟轨迹并减少真实机器人交互的次数。

Q 学习与时间差异与基于模型的强化学习

Q-learning vs temporal-difference vs model-based reinforcement learning

machine-learning

reinforcement-learning

temporal-difference

q-learning