为原始 material 采购构建开放 AI RL 环境的奖励函数
Structuring reward function for Open AI RL environment for raw material purchasing
我正在试验深度强化学习,并在我所处的环境中创建了以下 运行 购买原始 material 的模拟。起始数量是我在接下来的 12 周 (sim_weeks) 中购买的 material 数量。我必须购买 195000 磅的倍数,预计每周使用 45000 磅 material。
start_qty= 100000
sim_weeks = 12
purchase_mult = 195000
#days on hand cost =
forecast_qty = 45000
class ResinEnv(Env):
def __init__(self):
# Actions we can take: buy 0, buy 1x,
self.action_space = Discrete(2)
# purchase array space...
self.observation_space = Box(low=np.array([-1000000]), high=np.array([1000000]))
# Set start qty
self.state = start_qty
# Set purchase length
self.purchase_length = sim_weeks
#self.current_step = 1
def step(self, action):
# Apply action
#this gives us qty_available at the end of the week
self.state-=forecast_qty
#see if we need to buy
self.state += (action*purchase_mult)
#now calculate the days on hand from this:
days = self.state/forecast_qty/7
# Reduce weeks left to purchase by 1 week
self.purchase_length -= 1
#self.current_step+=1
# Calculate reward: reward is the negative of days_on_hand
if self.state<0:
reward = -10000
else:
reward = -days
# Check if shower is done
if self.purchase_length <= 0:
done = True
else:
done = False
# Set placeholder for info
info = {}
# Return step information
return self.state, reward, done, info
def render(self):
# Implement viz
pass
def reset(self):
# Reset qty
self.state = start_qty
self.purchase_length = sim_weeks
return self.state
我在争论奖励功能是否足够。我试图做的是最小化每个步骤的手头天数总和,其中给定步骤的手头天数由代码中的天数定义。我决定,既然目标是最大化奖励函数,那么我可以将手头价值转换为负数,然后使用这个新的负数作为奖励(因此最大化奖励将最小化手头天数)。然后我添加了让任何给定周的可用数量为负的严厉惩罚。
有更好的方法吗?我是这个主题的新手,一般来说也是 Python 的新手。任何意见是极大的赞赏!
我
我认为你应该考虑降低奖励的规模。检查 here and 以稳定神经网络中的训练。如果 RL 代理的唯一任务是最小化手头的天数总和,那么奖励系统就有意义了。只需要一点正常化!
我正在试验深度强化学习,并在我所处的环境中创建了以下 运行 购买原始 material 的模拟。起始数量是我在接下来的 12 周 (sim_weeks) 中购买的 material 数量。我必须购买 195000 磅的倍数,预计每周使用 45000 磅 material。
start_qty= 100000
sim_weeks = 12
purchase_mult = 195000
#days on hand cost =
forecast_qty = 45000
class ResinEnv(Env):
def __init__(self):
# Actions we can take: buy 0, buy 1x,
self.action_space = Discrete(2)
# purchase array space...
self.observation_space = Box(low=np.array([-1000000]), high=np.array([1000000]))
# Set start qty
self.state = start_qty
# Set purchase length
self.purchase_length = sim_weeks
#self.current_step = 1
def step(self, action):
# Apply action
#this gives us qty_available at the end of the week
self.state-=forecast_qty
#see if we need to buy
self.state += (action*purchase_mult)
#now calculate the days on hand from this:
days = self.state/forecast_qty/7
# Reduce weeks left to purchase by 1 week
self.purchase_length -= 1
#self.current_step+=1
# Calculate reward: reward is the negative of days_on_hand
if self.state<0:
reward = -10000
else:
reward = -days
# Check if shower is done
if self.purchase_length <= 0:
done = True
else:
done = False
# Set placeholder for info
info = {}
# Return step information
return self.state, reward, done, info
def render(self):
# Implement viz
pass
def reset(self):
# Reset qty
self.state = start_qty
self.purchase_length = sim_weeks
return self.state
我在争论奖励功能是否足够。我试图做的是最小化每个步骤的手头天数总和,其中给定步骤的手头天数由代码中的天数定义。我决定,既然目标是最大化奖励函数,那么我可以将手头价值转换为负数,然后使用这个新的负数作为奖励(因此最大化奖励将最小化手头天数)。然后我添加了让任何给定周的可用数量为负的严厉惩罚。
有更好的方法吗?我是这个主题的新手,一般来说也是 Python 的新手。任何意见是极大的赞赏! 我
我认为你应该考虑降低奖励的规模。检查 here and