OpenAI Gym - Maze - Using Q learning- "ValueError: dir cannot be 0. The only valid dirs are dict_keys(['N', 'E', 'S', 'W'])."
OpenAI Gym - Maze - Using Q learning- "ValueError: dir cannot be 0. The only valid dirs are dict_keys(['N', 'E', 'S', 'W'])."
我正在尝试使用 Q 学习训练代理来解决迷宫问题。
我使用以下方法创建了环境:
import gym
import gym_maze
import numpy as np
env = gym.make("maze-v0")
由于状态在 [x,y] 坐标中并且我想进行 2D Q 学习 table,我创建了一个将每个状态映射到一个值的字典:
states_dic = {}
count = 0
for i in range(5):
for j in range(5):
states_dic[i, j] = count
count+=1
然后我创建了Q table:
n_actions = env.action_space.n
#Initialize the Q-table to 0
Q_table = np.zeros((len(states_dic),n_actions))
print(Q_table)
[[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]]
一些变量:
#number of episode we will run
n_episodes = 10000
#maximum of iteration per episode
max_iter_episode = 100
#initialize the exploration probability to 1
exploration_proba = 1
#exploartion decreasing decay for exponential decreasing
exploration_decreasing_decay = 0.001
# minimum of exploration prob
min_exploration_proba = 0.01
#discounted factor
gamma = 0.99
#learning rate
lr = 0.1
rewards_per_episode = list()
但是当我尝试 运行 Q 学习算法时,我得到了标题中的错误。
#we iterate over episodes
for e in range(n_episodes):
#we initialize the first state of the episode
current_state = env.reset()
done = False
#sum the rewards that the agent gets from the environment
total_episode_reward = 0
for i in range(max_iter_episode):
if np.random.uniform(0,1) < exploration_proba:
action = env.action_space.sample()
else:
action = np.argmax(Q_table[current_state,:])
next_state, reward, done, _ = env.step(action)
current_coordinate_x = int(current_state[0])
current_coordinate_y = int(current_state[1])
next_coordinate_x = int(next_state[0])
next_coordinate_y = int(next_state[1])
# update Q-table using the Q-learning iteration
current_Q_table_coordinates = states_dic[current_coordinate_x, current_coordinate_y]
next_Q_table_coordinates = states_dic[next_coordinate_x, next_coordinate_y]
Q_table[current_Q_table_coordinates, action] = (1-lr) *Q_table[current_Q_table_coordinates, action] +lr*(reward + gamma*max(Q_table[next_Q_table_coordinates,:]))
total_episode_reward = total_episode_reward + reward
# If the episode is finished, we leave the for loop
if done:
break
current_state = next_state
#We update the exploration proba using exponential decay formula
exploration_proba = max(min_exploration_proba,\
np.exp(-exploration_decreasing_decay*e))
rewards_per_episode.append(total_episode_reward)
更新:
分享完整的错误回溯:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-11-74e6fe3c1212> in <module>()
25 # The environment runs the chosen action and returns
26 # the next state, a reward and true if the epiosed is ended.
---> 27 next_state, reward, done, _ = env.step(action)
28
29 #### #### #### ####
/Users/x/anaconda3/envs/y/lib/python3.6/site-packages/gym/wrappers/time_limit.py in step(self, action)
14 def step(self, action):
15 assert self._elapsed_steps is not None, "Cannot call env.step() before calling reset()"
---> 16 observation, reward, done, info = self.env.step(action)
17 self._elapsed_steps += 1
18 if self._elapsed_steps >= self._max_episode_steps:
/Users/x/anaconda3/envs/y/lib/python3.6/site-packages/gym_maze-0.4-py3.6.egg/gym_maze/envs/maze_env.py in step(self, action)
75 self.maze_view.move_robot(self.ACTION[action])
76 else:
---> 77 self.maze_view.move_robot(action)
78
79 if np.array_equal(self.maze_view.robot, self.maze_view.goal):
/Users/x/anaconda3/envs/y/lib/python3.6/site-packages/gym_maze-0.4-py3.6.egg/gym_maze/envs/maze_view_2d.py in move_robot(self, dir)
93 if dir not in self.__maze.COMPASS.keys():
94 raise ValueError("dir cannot be %s. The only valid dirs are %s."
---> 95 % (str(dir), str(self.__maze.COMPASS.keys())))
96
97 if self.__maze.is_open(self.__robot, dir):
ValueError: dir cannot be 1. The only valid dirs are dict_keys(['N', 'E', 'S', 'W']).
第二次更新:
感谢@Alexander L. Hayes 的一些调试。
#we iterate over episodes
for e in range(n_episodes):
#we initialize the first state of the episode
current_state = env.reset()
done = False
#sum the rewards that the agent gets from the environment
total_episode_reward = 0
for i in range(max_iter_episode):
current_coordinate_x = int(current_state[0])
current_coordinate_y = int(current_state[1])
current_Q_table_coordinates = states_dic[current_coordinate_x, current_coordinate_y]
if np.random.uniform(0,1) < exploration_proba:
action = env.action_space.sample()
else:
action = int(np.argmax(Q_table[current_Q_table_coordinates]))
next_state, reward, done, _ = env.step(action)
next_coordinate_x = int(next_state[0])
next_coordinate_y = int(next_state[1])
# update our Q-table using the Q-learning iteration
next_Q_table_coordinates = states_dic[next_coordinate_x, next_coordinate_y]
Q_table[current_Q_table_coordinates, action] = (1-lr) *Q_table[current_Q_table_coordinates, action] +lr*(reward + gamma*max(Q_table[next_Q_table_coordinates,:]))
total_episode_reward = total_episode_reward + reward
# If the episode is finished, we leave the for loop
if done:
break
current_state = next_state
#We update the exploration proba using exponential decay formula
exploration_proba = max(min_exploration_proba,\
np.exp(-exploration_decreasing_decay*e))
rewards_per_episode.append(total_episode_reward)
猜一猜(与答案有关,但与答案无关):
在体育馆的环境中(例如 FrozenLake)离散动作通常被编码为整数。
看起来错误是由此环境表示操作的非标准方式引起的。
我已经注释了设置 action
变量时我假设的类型:
if np.random.uniform(0,1) < exploration_proba:
# Is this a string?
action = env.action_space.sample()
else:
# np.argmax returns an int
action = np.argmax(Q_table[current_state,:])
用类似这样的东西替换 else
分支可能有效:
_action_map = {0: "N", 1: "E", 2: "S", 3: "W"}
action = _action_map[np.argmax(Q_table[current_state,:])]
第二次猜测(甚至不接近,但适合上下文):
看起来这是在 MattChanTK/gym-maze
存储库中工作。
MazeEnv.step()
函数似乎确实可以正确处理字符串与整数表示形式
- demo使用了和上面代码类似的方法,抽象成
select_action(state, explore_rate)
- 渲染似乎使用了一种替代方法来编码动作?
COMPASS
variable
第三次猜测(非常接近):
我已经缩小了从 Q 函数中选择的问题。这是我添加了断点的修改版本:
for e in range(n_episodes):
current_state = env.reset()
done = False
total_episode_reward = 0
for i in range(max_iter_episode):
if np.random.uniform(0,1) < exploration_proba:
action = env.action_space.sample()
else:
print("From Q_table:")
action = np.argmax(Q_table[current_state,:])
import pdb; pdb.set_trace()
解决方案(我不能相信,@Penguin 得到了☺️)
将 current_state
转换为坐标,并将 np.argmax
转换为 int
:
for i in range(max_iter_episode):
current_coordinate_x = int(current_state[0])
current_coordinate_y = int(current_state[1])
current_Q_table_coordinates = states_dic[current_coordinate_x, current_coordinate_y]
if np.random.uniform(0,1) < exploration_proba:
action = env.action_space.sample()
else:
action = int(np.argmax(Q_table[current_Q_table_coordinates]))
我正在尝试使用 Q 学习训练代理来解决迷宫问题。
我使用以下方法创建了环境:
import gym
import gym_maze
import numpy as np
env = gym.make("maze-v0")
由于状态在 [x,y] 坐标中并且我想进行 2D Q 学习 table,我创建了一个将每个状态映射到一个值的字典:
states_dic = {}
count = 0
for i in range(5):
for j in range(5):
states_dic[i, j] = count
count+=1
然后我创建了Q table:
n_actions = env.action_space.n
#Initialize the Q-table to 0
Q_table = np.zeros((len(states_dic),n_actions))
print(Q_table)
[[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]]
一些变量:
#number of episode we will run
n_episodes = 10000
#maximum of iteration per episode
max_iter_episode = 100
#initialize the exploration probability to 1
exploration_proba = 1
#exploartion decreasing decay for exponential decreasing
exploration_decreasing_decay = 0.001
# minimum of exploration prob
min_exploration_proba = 0.01
#discounted factor
gamma = 0.99
#learning rate
lr = 0.1
rewards_per_episode = list()
但是当我尝试 运行 Q 学习算法时,我得到了标题中的错误。
#we iterate over episodes
for e in range(n_episodes):
#we initialize the first state of the episode
current_state = env.reset()
done = False
#sum the rewards that the agent gets from the environment
total_episode_reward = 0
for i in range(max_iter_episode):
if np.random.uniform(0,1) < exploration_proba:
action = env.action_space.sample()
else:
action = np.argmax(Q_table[current_state,:])
next_state, reward, done, _ = env.step(action)
current_coordinate_x = int(current_state[0])
current_coordinate_y = int(current_state[1])
next_coordinate_x = int(next_state[0])
next_coordinate_y = int(next_state[1])
# update Q-table using the Q-learning iteration
current_Q_table_coordinates = states_dic[current_coordinate_x, current_coordinate_y]
next_Q_table_coordinates = states_dic[next_coordinate_x, next_coordinate_y]
Q_table[current_Q_table_coordinates, action] = (1-lr) *Q_table[current_Q_table_coordinates, action] +lr*(reward + gamma*max(Q_table[next_Q_table_coordinates,:]))
total_episode_reward = total_episode_reward + reward
# If the episode is finished, we leave the for loop
if done:
break
current_state = next_state
#We update the exploration proba using exponential decay formula
exploration_proba = max(min_exploration_proba,\
np.exp(-exploration_decreasing_decay*e))
rewards_per_episode.append(total_episode_reward)
更新:
分享完整的错误回溯:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-11-74e6fe3c1212> in <module>()
25 # The environment runs the chosen action and returns
26 # the next state, a reward and true if the epiosed is ended.
---> 27 next_state, reward, done, _ = env.step(action)
28
29 #### #### #### ####
/Users/x/anaconda3/envs/y/lib/python3.6/site-packages/gym/wrappers/time_limit.py in step(self, action)
14 def step(self, action):
15 assert self._elapsed_steps is not None, "Cannot call env.step() before calling reset()"
---> 16 observation, reward, done, info = self.env.step(action)
17 self._elapsed_steps += 1
18 if self._elapsed_steps >= self._max_episode_steps:
/Users/x/anaconda3/envs/y/lib/python3.6/site-packages/gym_maze-0.4-py3.6.egg/gym_maze/envs/maze_env.py in step(self, action)
75 self.maze_view.move_robot(self.ACTION[action])
76 else:
---> 77 self.maze_view.move_robot(action)
78
79 if np.array_equal(self.maze_view.robot, self.maze_view.goal):
/Users/x/anaconda3/envs/y/lib/python3.6/site-packages/gym_maze-0.4-py3.6.egg/gym_maze/envs/maze_view_2d.py in move_robot(self, dir)
93 if dir not in self.__maze.COMPASS.keys():
94 raise ValueError("dir cannot be %s. The only valid dirs are %s."
---> 95 % (str(dir), str(self.__maze.COMPASS.keys())))
96
97 if self.__maze.is_open(self.__robot, dir):
ValueError: dir cannot be 1. The only valid dirs are dict_keys(['N', 'E', 'S', 'W']).
第二次更新: 感谢@Alexander L. Hayes 的一些调试。
#we iterate over episodes
for e in range(n_episodes):
#we initialize the first state of the episode
current_state = env.reset()
done = False
#sum the rewards that the agent gets from the environment
total_episode_reward = 0
for i in range(max_iter_episode):
current_coordinate_x = int(current_state[0])
current_coordinate_y = int(current_state[1])
current_Q_table_coordinates = states_dic[current_coordinate_x, current_coordinate_y]
if np.random.uniform(0,1) < exploration_proba:
action = env.action_space.sample()
else:
action = int(np.argmax(Q_table[current_Q_table_coordinates]))
next_state, reward, done, _ = env.step(action)
next_coordinate_x = int(next_state[0])
next_coordinate_y = int(next_state[1])
# update our Q-table using the Q-learning iteration
next_Q_table_coordinates = states_dic[next_coordinate_x, next_coordinate_y]
Q_table[current_Q_table_coordinates, action] = (1-lr) *Q_table[current_Q_table_coordinates, action] +lr*(reward + gamma*max(Q_table[next_Q_table_coordinates,:]))
total_episode_reward = total_episode_reward + reward
# If the episode is finished, we leave the for loop
if done:
break
current_state = next_state
#We update the exploration proba using exponential decay formula
exploration_proba = max(min_exploration_proba,\
np.exp(-exploration_decreasing_decay*e))
rewards_per_episode.append(total_episode_reward)
猜一猜(与答案有关,但与答案无关):
在体育馆的环境中(例如 FrozenLake)离散动作通常被编码为整数。
看起来错误是由此环境表示操作的非标准方式引起的。
我已经注释了设置 action
变量时我假设的类型:
if np.random.uniform(0,1) < exploration_proba:
# Is this a string?
action = env.action_space.sample()
else:
# np.argmax returns an int
action = np.argmax(Q_table[current_state,:])
用类似这样的东西替换 else
分支可能有效:
_action_map = {0: "N", 1: "E", 2: "S", 3: "W"}
action = _action_map[np.argmax(Q_table[current_state,:])]
第二次猜测(甚至不接近,但适合上下文):
看起来这是在 MattChanTK/gym-maze
存储库中工作。
MazeEnv.step()
函数似乎确实可以正确处理字符串与整数表示形式- demo使用了和上面代码类似的方法,抽象成
select_action(state, explore_rate)
- 渲染似乎使用了一种替代方法来编码动作?
COMPASS
variable
第三次猜测(非常接近):
我已经缩小了从 Q 函数中选择的问题。这是我添加了断点的修改版本:
for e in range(n_episodes):
current_state = env.reset()
done = False
total_episode_reward = 0
for i in range(max_iter_episode):
if np.random.uniform(0,1) < exploration_proba:
action = env.action_space.sample()
else:
print("From Q_table:")
action = np.argmax(Q_table[current_state,:])
import pdb; pdb.set_trace()
解决方案(我不能相信,@Penguin 得到了☺️)
将 current_state
转换为坐标,并将 np.argmax
转换为 int
:
for i in range(max_iter_episode):
current_coordinate_x = int(current_state[0])
current_coordinate_y = int(current_state[1])
current_Q_table_coordinates = states_dic[current_coordinate_x, current_coordinate_y]
if np.random.uniform(0,1) < exploration_proba:
action = env.action_space.sample()
else:
action = int(np.argmax(Q_table[current_Q_table_coordinates]))