我如何更改它以使用 q table 进行强化学习
How can I change this to use a q table for reinforcement learning
我正在通过仅使用一维数组向前和向后移动的简单版本学习 q-tables 和 运行。现在我正在尝试 4 方向移动并卡在控制人上。
我现在把 运行dom 移动下来,它最终会找到目标。但我希望它学习如何实现目标,而不是 运行 在目标上跌跌撞撞。因此,我将不胜感激有关在此代码中添加 qlearning 的任何建议。谢谢。
这是我的完整代码,因为它现在非常简单。
import numpy as np
import random
import math
world = np.zeros((5,5))
print(world)
# Make sure that it can never be 0 i.e the start point
goal_x = random.randint(1,4)
goal_y = random.randint(1,4)
goal = (goal_x, goal_y)
print(goal)
world[goal] = 1
print(world)
LEFT = 0
RIGHT = 1
UP = 2
DOWN = 3
map_range_min = 0
map_range_max = 5
class Agent:
def __init__(self, current_position, my_goal, world):
self.current_position = current_position
self.last_postion = current_position
self.visited_positions = []
self.goal = my_goal
self.last_reward = 0
self.totalReward = 0
self.q_table = world
# Update the totoal reward by the reward
def updateReward(self, extra_reward):
# This will either increase or decrese the total reward for the episode
x = (self.goal[0] - self.current_position[0]) **2
y = (self.goal[1] - self.current_position[1]) **2
dist = math.sqrt(x + y)
complet_reward = dist + extra_reward
self.totalReward += complet_reward
def validate_move(self):
valid_move_set = []
# Check for x ranges
if map_range_min < self.current_position[0] < map_range_max:
valid_move_set.append(LEFT)
valid_move_set.append(RIGHT)
elif map_range_min == self.current_position[0]:
valid_move_set.append(RIGHT)
else:
valid_move_set.append(LEFT)
# Check for Y ranges
if map_range_min < self.current_position[1] < map_range_max:
valid_move_set.append(UP)
valid_move_set.append(DOWN)
elif map_range_min == self.current_position[1]:
valid_move_set.append(DOWN)
else:
valid_move_set.append(UP)
return valid_move_set
# Make the agent move
def move_right(self):
self.last_postion = self.current_position
x = self.current_position[0]
x += 1
y = self.current_position[1]
return (x, y)
def move_left(self):
self.last_postion = self.current_position
x = self.current_position[0]
x -= 1
y = self.current_position[1]
return (x, y)
def move_down(self):
self.last_postion = self.current_position
x = self.current_position[0]
y = self.current_position[1]
y += 1
return (x, y)
def move_up(self):
self.last_postion = self.current_position
x = self.current_position[0]
y = self.current_position[1]
y -= 1
return (x, y)
def move_agent(self):
move_set = self.validate_move()
randChoice = random.randint(0, len(move_set)-1)
move = move_set[randChoice]
if move == UP:
return self.move_up()
elif move == DOWN:
return self.move_down()
elif move == RIGHT:
return self.move_right()
else:
return self.move_left()
# Update the rewards
# Return True to kill the episode
def checkPosition(self):
if self.current_position == self.goal:
print("Found Goal")
self.updateReward(10)
return False
else:
#Chose new direction
self.current_position = self.move_agent()
self.visited_positions.append(self.current_position)
# Currently get nothing for not reaching the goal
self.updateReward(0)
return True
gus = Agent((0, 0) , goal)
play = gus.checkPosition()
while play:
play = gus.checkPosition()
print(gus.totalReward)
根据您的代码示例,我有一些建议:
将环境与代理分开。环境需要具有 new_state, reward = env.step(old_state, action)
形式的方法。此方法说明一个动作如何将您的旧状态转换为新状态。将您的状态和动作编码为简单的整数是个好主意。我强烈建议为此方法设置单元测试。
然后代理需要一个等效的方法action = agent.policy(state, reward)
。作为第一步,您应该手动编写一个执行您认为正确的代理程序。例如,它可能只是试图前往目标位置。
考虑状态表示是否马尔可夫的问题。如果你对你访问过的所有过去状态都有记忆,那么你可以在这个问题上做得更好,那么这个状态就没有马尔可夫属性。最好状态表示应该是紧凑的(仍然是马尔可夫的最小集合)。
一旦这个结构是set-up,你就可以考虑真正学习一个Qtable。一种可能的方法(很容易理解但不一定那么有效)是 Monte Carlo 探索开始或 epsilon-soft 贪婪。一本好的 RL 书籍应该为任一变体提供伪代码。
自信的时候,就来openai gymhttps://gym.openai.com/ for some more detailed class structures. There are some hints about creating your own environments here: https://gym.openai.com/docs/#environments
我正在通过仅使用一维数组向前和向后移动的简单版本学习 q-tables 和 运行。现在我正在尝试 4 方向移动并卡在控制人上。
我现在把 运行dom 移动下来,它最终会找到目标。但我希望它学习如何实现目标,而不是 运行 在目标上跌跌撞撞。因此,我将不胜感激有关在此代码中添加 qlearning 的任何建议。谢谢。
这是我的完整代码,因为它现在非常简单。
import numpy as np
import random
import math
world = np.zeros((5,5))
print(world)
# Make sure that it can never be 0 i.e the start point
goal_x = random.randint(1,4)
goal_y = random.randint(1,4)
goal = (goal_x, goal_y)
print(goal)
world[goal] = 1
print(world)
LEFT = 0
RIGHT = 1
UP = 2
DOWN = 3
map_range_min = 0
map_range_max = 5
class Agent:
def __init__(self, current_position, my_goal, world):
self.current_position = current_position
self.last_postion = current_position
self.visited_positions = []
self.goal = my_goal
self.last_reward = 0
self.totalReward = 0
self.q_table = world
# Update the totoal reward by the reward
def updateReward(self, extra_reward):
# This will either increase or decrese the total reward for the episode
x = (self.goal[0] - self.current_position[0]) **2
y = (self.goal[1] - self.current_position[1]) **2
dist = math.sqrt(x + y)
complet_reward = dist + extra_reward
self.totalReward += complet_reward
def validate_move(self):
valid_move_set = []
# Check for x ranges
if map_range_min < self.current_position[0] < map_range_max:
valid_move_set.append(LEFT)
valid_move_set.append(RIGHT)
elif map_range_min == self.current_position[0]:
valid_move_set.append(RIGHT)
else:
valid_move_set.append(LEFT)
# Check for Y ranges
if map_range_min < self.current_position[1] < map_range_max:
valid_move_set.append(UP)
valid_move_set.append(DOWN)
elif map_range_min == self.current_position[1]:
valid_move_set.append(DOWN)
else:
valid_move_set.append(UP)
return valid_move_set
# Make the agent move
def move_right(self):
self.last_postion = self.current_position
x = self.current_position[0]
x += 1
y = self.current_position[1]
return (x, y)
def move_left(self):
self.last_postion = self.current_position
x = self.current_position[0]
x -= 1
y = self.current_position[1]
return (x, y)
def move_down(self):
self.last_postion = self.current_position
x = self.current_position[0]
y = self.current_position[1]
y += 1
return (x, y)
def move_up(self):
self.last_postion = self.current_position
x = self.current_position[0]
y = self.current_position[1]
y -= 1
return (x, y)
def move_agent(self):
move_set = self.validate_move()
randChoice = random.randint(0, len(move_set)-1)
move = move_set[randChoice]
if move == UP:
return self.move_up()
elif move == DOWN:
return self.move_down()
elif move == RIGHT:
return self.move_right()
else:
return self.move_left()
# Update the rewards
# Return True to kill the episode
def checkPosition(self):
if self.current_position == self.goal:
print("Found Goal")
self.updateReward(10)
return False
else:
#Chose new direction
self.current_position = self.move_agent()
self.visited_positions.append(self.current_position)
# Currently get nothing for not reaching the goal
self.updateReward(0)
return True
gus = Agent((0, 0) , goal)
play = gus.checkPosition()
while play:
play = gus.checkPosition()
print(gus.totalReward)
根据您的代码示例,我有一些建议:
将环境与代理分开。环境需要具有
new_state, reward = env.step(old_state, action)
形式的方法。此方法说明一个动作如何将您的旧状态转换为新状态。将您的状态和动作编码为简单的整数是个好主意。我强烈建议为此方法设置单元测试。然后代理需要一个等效的方法
action = agent.policy(state, reward)
。作为第一步,您应该手动编写一个执行您认为正确的代理程序。例如,它可能只是试图前往目标位置。考虑状态表示是否马尔可夫的问题。如果你对你访问过的所有过去状态都有记忆,那么你可以在这个问题上做得更好,那么这个状态就没有马尔可夫属性。最好状态表示应该是紧凑的(仍然是马尔可夫的最小集合)。
一旦这个结构是set-up,你就可以考虑真正学习一个Qtable。一种可能的方法(很容易理解但不一定那么有效)是 Monte Carlo 探索开始或 epsilon-soft 贪婪。一本好的 RL 书籍应该为任一变体提供伪代码。
自信的时候,就来openai gymhttps://gym.openai.com/ for some more detailed class structures. There are some hints about creating your own environments here: https://gym.openai.com/docs/#environments