使用 Temporal difference learning 有什么意义呢？

What's the point of using Temporal difference learning at all?

据我所知，对于一个特定的策略\pi，时间差分学习让我们计算出遵循该策略\pi的期望值，但是知道一个特定的策略有什么意义？

我们不应该尝试为给定环境寻找最优策略吗？使用时间差异学习做一个特定的 \pi 有什么意义？

正如您所说，仅找到给定策略的价值函数在一般情况下并不是很有用，目标是找到最优策略。然而，一些经典算法，如 SARSA 或 Q-learning，可以看作是 generalized policy iteration 的特例，其中最困难的部分是找到策略的价值函数。一旦你知道了价值函数，就很容易找到一个更好的策略，然后再次找到最近计算的策略的价值函数，等等。这个过程，给定一些条件，收敛到最优策略。

总而言之，temporal difference learning 是其他算法中寻找最优策略的关键步骤。

使用 Temporal difference learning 有什么意义呢？

What's the point of using Temporal difference learning at all?

reinforcement-learning

temporal-difference