Q Learning 中的 n 维向量状态向量是如何表示的？

Question

使用此代码：

import gym
import numpy as np
import time

"""
SARSA on policy learning python implementation.
This is a python implementation of the SARSA algorithm in the Sutton and Barto's book on
RL. It's called SARSA because - (state, action, reward, state, action). The only difference
between SARSA and Qlearning is that SARSA takes the next action based on the current policy
while qlearning takes the action with maximum utility of next state.
Using the simplest gym environment for brevity: https://gym.openai.com/envs/FrozenLake-v0/
"""

def init_q(s, a, type="ones"):
    """
    @param s the number of states
    @param a the number of actions
    @param type random, ones or zeros for the initialization
    """
    if type == "ones":
        return np.ones((s, a))
    elif type == "random":
        return np.random.random((s, a))
    elif type == "zeros":
        return np.zeros((s, a))


def epsilon_greedy(Q, epsilon, n_actions, s, train=False):
    """
    @param Q Q values state x action -> value
    @param epsilon for exploration
    @param s number of states
    @param train if true then no random actions selected
    """
    if train or np.random.rand() < epsilon:
        action = np.argmax(Q[s, :])
    else:
        action = np.random.randint(0, n_actions)
    return action

def sarsa(alpha, gamma, epsilon, episodes, max_steps, n_tests, render = True, test=False):
    """
    @param alpha learning rate
    @param gamma decay factor
    @param epsilon for exploration
    @param max_steps for max step in each episode
    @param n_tests number of test episodes
    """
    env = gym.make('Taxi-v3')
    n_states, n_actions = env.observation_space.n, env.action_space.n
    Q = init_q(n_states, n_actions, type="ones")
    print('Q shape:' , Q.shape)

    timestep_reward = []
    for episode in range(episodes):
        print(f"Episode: {episode}")
        total_reward = 0
        s = env.reset()
        print('s:' , s)
        a = epsilon_greedy(Q, epsilon, n_actions, s)
        t = 0
        done = False
        while t < max_steps:
            if render:
                env.render()
            t += 1
            s_, reward, done, info = env.step(a)
            total_reward += reward
            a_ = epsilon_greedy(Q, epsilon, n_actions, s_)
            if done:
                Q[s, a] += alpha * ( reward  - Q[s, a] )
            else:
                Q[s, a] += alpha * ( reward + (gamma * Q[s_, a_] ) - Q[s, a] )
            s, a = s_, a_
            if done:
                if render:
                    print(f"This episode took {t} timesteps and reward {total_reward}")
                timestep_reward.append(total_reward)
                break
#             print('Updated Q values:' , Q)
    if render:
        print(f"Here are the Q values:\n{Q}\nTesting now:")
    if test:
        test_agent(Q, env, n_tests, n_actions)
    return timestep_reward

def test_agent(Q, env, n_tests, n_actions, delay=0.1):
    for test in range(n_tests):
        print(f"Test #{test}")
        s = env.reset()
        done = False
        epsilon = 0
        total_reward = 0
        while True:
            time.sleep(delay)
            env.render()
            a = epsilon_greedy(Q, epsilon, n_actions, s, train=True)
            print(f"Chose action {a} for state {s}")
            s, reward, done, info = env.step(a)
            total_reward += reward
            if done:  
                print(f"Episode reward: {total_reward}")
                time.sleep(1)
                break


if __name__ =="__main__":
    alpha = 0.4
    gamma = 0.999
    epsilon = 0.9
    episodes = 200
    max_steps = 20
    n_tests = 20
    timestep_reward = sarsa(alpha, gamma, epsilon, episodes, max_steps, n_tests)
    print(timestep_reward)

来自 :

https://towardsdatascience.com/reinforcement-learning-temporal-difference-sarsa-q-learning-expected-sarsa-on-python-9fecfda7467e

生成的样本 Q table 是：

[[ 1.          1.          1.          1.          1.          1.        ]
 [ 0.5996      0.5996      0.5996      0.35936     0.5996      1.        ]
 [ 0.19936016  0.35936     0.10336026  0.35936     0.35936    -5.56063984]
 ...
 [ 0.35936     0.5996      0.35936     0.5996      1.          1.        ]
 [ 1.          0.5996      1.          1.          1.          1.        ]
 [ 0.35936     0.5996      1.          1.          1.          1.        ]]

表示动作的列和表示相应状态的行。

状态可以用向量表示吗？ Q table 个单元不包含在大小 > 1 的向量中，那么应该如何表示这些状态？例如，如果我处于状态 [2]，这可以表示为 n 维向量吗？

换句话说，如果 Q[1,3] = 4，Q 状态 1 和动作 3 可以表示为 vector[1,3,2,12,3] 吗？如果是这样，那么 state_number->state_attributes 映射是否存储在单独的查找 table 中？

Answer 1

是的，状态可以用你想要的任何东西来表示，包括任意长度的向量。但是请注意，如果您使用的是表格版本的 Q-learning（或本例中的 SARSA），则必须有一组离散的状态。因此，您需要一种方法将您的状态表示（例如，潜在连续值的向量）映射到一组离散状态。

扩展您给出的示例，假设您有三个由向量表示的状态：

s0 = [1, 3, 2, 12, 3]
s1 = [3, 1, 2, 2, 23]
s2 = [2, 12, 3, 2, 1]

最后，你只有 3 个状态，无论它们是如何表示的。您可以将向量映射到状态 s0、s1 和 s2，并使用简单的 Q-table。或者您可以使用其他使用向量表示（即 [1, 3, 2, 12, 3]）作为索引的数据结构。

另一方面，如果您的状态 space 是连续的并且您不想将其离散化，那么您可以使用函数逼近器（例如，神经网络）来存储 Q 值而不是table。但这是另一个话题（更多信息在 Sutton & Barto RL book 的第 8 章）。

Q Learning 中的 n 维向量状态向量是如何表示的？

How are n dimensional vectors state vectors represented in Q Learning?

reinforcement-learning

q-learning