选择具有近似值的 On-Policy 预测函数

Question

我目前正在阅读 Sutton 关于强化学习的介绍。在读到第 10 章（带近似的基于策略的预测）之后，我现在想知道如何选择函数的形式 q 来近似最佳权重 w。

我指的是下面来自 Sutton 的伪代码的第一行：如何选择一个好的可微函数？有什么标准的策略可以选择吗？

Answer 1

您可以选择任何可微的函数逼近器。两个常用的类值函数逼近器是：

线性函数逼近器：特征的线性组合

 For approximating Q (the action-value)
 1. Find features that are functions of states and actions.
 2. Represent q as a weighted combination of these features.

其中 is a vector in with component given by and is the weight vector whose componenet is given by .

神经网络

使用神经网络表示。您可以使用 action-in（下图左侧）类型或 action-out（下图右侧）类型进行近似。不同之处在于，神经网络既可以将状态和动作的表示作为输入并产生单个值（Q 值）作为输出，也可以仅将表示作为输入状态 s 并为每个动作提供一个输出值，a 在动作 space 中（如果动作 [=94= 这种类型更容易实现） ] 是离散和有限的）。

使用第一种类型 (action-in) 作为示例，因为它接近线性情况下的示例，您可以使用神经网络创建 Q 值逼近器使用以下方法联网：
```
  Represent the state-action value as a normalized vector
  (or as a one-hot vector representing the state and action)
  1. Input layer : Size= number of inputs
  2. `n` hidden layers with `m` neurons
  3. Output layer: single output neuron
  Sigmoid activation function.
  Update weights using gradient descent as per the * semi-gradient Sarsa algorithm*.
```
您也可以直接使用视觉效果（如果可用）作为输入，并使用 DQN paper 中的卷积层。但是请阅读下面关于收敛和其他技巧的注释，以稳定这种基于非线性逼近器的方法。

从图形上看，函数逼近器如下所示：

请注意， is an elementary function and 用于表示状态-动作向量的元素。您可以使用任何初等函数代替 . Some common ones are linear regressors, Radial Basis Functions 等

一个好的可微函数取决于上下文。但在强化学习设置中，收敛特性和误差范围很重要。书中讨论的 Episodic semi-gradient Sarsa 算法具有与 TD(0) 相似的收敛特性。

由于您特别要求进行策略预测，因此建议使用线性函数逼近器，因为它可以保证收敛。以下是使线性函数逼近器适用的一些其他属性：

误差曲面成为具有均方误差函数的单个最小值的二次曲面。这使得它成为一个可靠的解决方案，因为梯度下降保证找到全局最优的最小值。
误差范围（如 Tsitsiklis & Roy,1997 证明的一般情况下的 TD(lambda) ）是：

这意味着渐近误差不会超过 times the smallest possible error. Where 是折扣因子。梯度很容易计算！

然而，使用非线性逼近器（如（深度）神经网络）并不能从本质上保证收敛。梯度 TD 方法使用投影贝尔曼误差的真实梯度进行更新，而不是 半梯度 Sarsa 算法 中使用的 半梯度众所周知，如果满足某些条件，它会提供 convergence even with non-linear function approximators（即使对于离策略预测）。

选择具有近似值的 On-Policy 预测函数

Choose function for On-Policy prediction with approximation

reinforcement-learning

approximation