井字游戏的 Q 学习算法

Question

我不明白如何更新井字游戏的 Q 值。我阅读了所有相关内容，但我无法想象如何做到这一点。我看到Q值是在游戏结束时更新的，但我一直不明白是否每个动作都有Q值？

Answer 1

每个状态-动作对都有一个 Q 值。您执行每个操作后更新一个 Q 值。更准确地说，如果从状态 s1 应用操作 a1 让你进入状态 s2 并给你带来一些奖励 r，那么你更新 Q(s1, a1) 如下：

Q(s1, a1) = Q(s1, a1) + learning_rate * (r + discount_factor * max Q(s2, _) - Q(s1, a1))

在许多游戏中，例如井字棋，您要到游戏结束才能获得奖励，这就是为什么您必须运行算法通过几集。这就是关于最终状态效用的信息如何传播到其他状态。

Answer 2

标准 Q 学习算法的问题在于，将值从最后一步传播到第一步所花费的时间太长，因为您只知道游戏结束时的结果。

因此应该修改Q Learning算法。以下论文提供了一些可能的修改细节：

游戏结束后给予非负奖励（平局除外），然后Q更新不会在每个动作步骤进行（这不会改变任何东西），但是仅在比赛结束后
Q更新是通过从最后一步传播它的新值来执行的倒退到第一步
由于双人游戏的轮流性质，引入了另一个更新公式，该公式也考虑了对手的观点

摘要：

This paper reports our experiment on applying Q Learning algorithm for learning to play Tic-tac-toe. The original algorithm is modified by updating the Q value only when the game terminates, propagating the update process from the final move backward to the first move, and incorporating a new update rule. We evaluate the agent performance using full-board and partial-board representations. In this evaluation, the agent plays the tic-tac-toe game against human players. The evaluation results show that the performance of modified Q Learning algorithm with partial-board representation is comparable to that of human players.

Learning to Play Tic-Tac-Toe (2009) by Dwi H. Widyantoro & Yus G. Vembrina

（不幸的是，它在付费专区后面。您可以访问 IEEE 档案，或者您可以要求作者在 researchgate 上提供一份副本：https://www.researchgate.net/publication/251899151_Learning_to_play_Tic-tac-toe）

井字游戏的 Q 学习算法

Q Learning Algorithm for Tic Tac Toe

artificial-intelligence

machine-learning

reinforcement-learning

tic-tac-toe

q-learning