Q-learning 中的探索和开发

exploration and exploitation in Q-learning

在 Q 学习算法中，动作的选择取决于当前状态和 Q 矩阵的值。我想知道这些 Q 值是仅在探索步骤中更新还是在开发步骤中也发生变化。

如果您阅读了 Q-learning 算法代码，例如来自 Sutton & Barto book：

很明显 Q 值总是独立更新的，无论所选择的动作是否是探索性的。

请注意第 "Choose a from s using policy derived from Q (e.g., epsilon-greedy)" 行表示该操作有时是探索性的。