为什么 Cartpole 游戏的 DQN 有上升的奖励而损失没有下降?
Why DQN for cartpole game has a ascending reward while loss is not descending?
我编写了一个 DQN 来使用 TensorFlow 和 tf_agents 玩 OpenAI 健身房推车杆游戏。代码如下所示:
def compute_avg_return(environment, policy, num_episodes=10):
total_return = 0.0
for _ in range(num_episodes):
time_step = environment.reset()
episode_return = 0.0
while not time_step.is_last():
action_step = policy.action(time_step)
time_step = environment.step(action_step.action)
episode_return += time_step.reward
total_return += episode_return
avg_return = total_return / num_episodes
return avg_return.numpy()[0]
def collect_step(environment, policy, buffer):
time_step = environment.current_time_step()
action_step = policy.action(time_step)
next_time_step = environment.step(action_step.action)
traj = trajectory.from_transition(time_step, action_step, next_time_step)
buffer.add_batch(traj)
def collect_data(env, policy, buffer, steps):
for _ in range(steps):
collect_step(env, policy, buffer)
def train_model(
num_iterations=config.default_num_iterations,
collect_steps_per_iteration=config.default_collect_steps_per_iteration,
replay_buffer_max_length=config.default_replay_buffer_max_length,
batch_size=config.default_batch_size,
learning_rate=config.default_learning_rate,
log_interval=config.default_log_interval,
num_eval_episodes=config.default_num_eval_episodes,
eval_interval=config.default_eval_interval,
checkpoint_saver_directory=config.default_checkpoint_saver_directory,
model_saver_directory=config.default_model_saver_directory,
visualize=False,
static_plot=False,
):
env_name = 'CartPole-v0'
train_py_env = suite_gym.load(env_name)
eval_py_env = suite_gym.load(env_name)
train_env = tf_py_environment.TFPyEnvironment(train_py_env)
eval_env = tf_py_environment.TFPyEnvironment(eval_py_env)
fc_layer_params = (100,)
q_net = q_network.QNetwork(
train_env.observation_spec(),
train_env.action_spec(),
fc_layer_params=fc_layer_params)
optimizer = Adam(learning_rate=learning_rate)
train_step_counter = tf.Variable(0)
agent = dqn_agent.DqnAgent(
train_env.time_step_spec(),
train_env.action_spec(),
q_network=q_net,
optimizer=optimizer,
td_errors_loss_fn=common.element_wise_squared_loss,
train_step_counter=train_step_counter)
agent.initialize()
replay_buffer = tf_uniform_replay_buffer.TFUniformReplayBuffer(
data_spec=agent.collect_data_spec,
batch_size=train_env.batch_size,
max_length=replay_buffer_max_length)
dataset = replay_buffer.as_dataset(
num_parallel_calls=3,
sample_batch_size=batch_size,
num_steps=2).prefetch(3)
iterator = iter(dataset)
agent.train_step_counter.assign(0)
avg_return = compute_avg_return(eval_env, agent.policy, num_eval_episodes)
returns = []
loss = []
for _ in range(num_iterations):
for _ in range(collect_steps_per_iteration):
collect_step(train_env, agent.collect_policy, replay_buffer)
experience, unused_info = next(iterator)
train_loss = agent.train(experience).loss
step = agent.train_step_counter.numpy()
avg_return = compute_avg_return(eval_env, agent.policy, num_eval_episodes)
returns.append(avg_return)
虽然平均reward越来越好,达到了200,最高分,但是最后,loss并没有明显下降。
这是损失图:
奖励图如下:
好的一点是模型成功了,玩游戏真的很爽。但是,我真的很想深入了解为什么会发生这种情况,即使是极高的损失仍然会产生丰厚的回报。
可能与您 Q 值的规模有关。我在我的 DQN 损失中有相同的行为,我的代理很容易解决环境问题,但损失在通过训练增加。
如果您查看 DQN 算法的这一部分,您可能会得到一些见解:
- 首先你会注意到目标 y 是建立在目标网络的 max Q 值之上的。正如 Double-DQN paper 中所证明的那样,它可能会导致对目标 Q 值的持续高估。由于目标可能会不断被高估而预测不会,因此预测和目标之间始终存在差异
- 其次,随着 Q 值的增长,此增量的规模也会扩大。我认为这是一种正常行为,因为你的 Q 函数会了解到许多状态都有一个重要的值,所以训练开始时的误差可能比结束时的误差小得多
- 第三,目标 Q 网络被冻结了一些步骤,而预测 Q 网络不断变化,这也促成了这个增量
希望这对您有所帮助,请注意,这纯粹是个人的直觉解释,我没有进行任何测试来验证我的假设。我认为第二点可能是最重要的。
我编写了一个 DQN 来使用 TensorFlow 和 tf_agents 玩 OpenAI 健身房推车杆游戏。代码如下所示:
def compute_avg_return(environment, policy, num_episodes=10):
total_return = 0.0
for _ in range(num_episodes):
time_step = environment.reset()
episode_return = 0.0
while not time_step.is_last():
action_step = policy.action(time_step)
time_step = environment.step(action_step.action)
episode_return += time_step.reward
total_return += episode_return
avg_return = total_return / num_episodes
return avg_return.numpy()[0]
def collect_step(environment, policy, buffer):
time_step = environment.current_time_step()
action_step = policy.action(time_step)
next_time_step = environment.step(action_step.action)
traj = trajectory.from_transition(time_step, action_step, next_time_step)
buffer.add_batch(traj)
def collect_data(env, policy, buffer, steps):
for _ in range(steps):
collect_step(env, policy, buffer)
def train_model(
num_iterations=config.default_num_iterations,
collect_steps_per_iteration=config.default_collect_steps_per_iteration,
replay_buffer_max_length=config.default_replay_buffer_max_length,
batch_size=config.default_batch_size,
learning_rate=config.default_learning_rate,
log_interval=config.default_log_interval,
num_eval_episodes=config.default_num_eval_episodes,
eval_interval=config.default_eval_interval,
checkpoint_saver_directory=config.default_checkpoint_saver_directory,
model_saver_directory=config.default_model_saver_directory,
visualize=False,
static_plot=False,
):
env_name = 'CartPole-v0'
train_py_env = suite_gym.load(env_name)
eval_py_env = suite_gym.load(env_name)
train_env = tf_py_environment.TFPyEnvironment(train_py_env)
eval_env = tf_py_environment.TFPyEnvironment(eval_py_env)
fc_layer_params = (100,)
q_net = q_network.QNetwork(
train_env.observation_spec(),
train_env.action_spec(),
fc_layer_params=fc_layer_params)
optimizer = Adam(learning_rate=learning_rate)
train_step_counter = tf.Variable(0)
agent = dqn_agent.DqnAgent(
train_env.time_step_spec(),
train_env.action_spec(),
q_network=q_net,
optimizer=optimizer,
td_errors_loss_fn=common.element_wise_squared_loss,
train_step_counter=train_step_counter)
agent.initialize()
replay_buffer = tf_uniform_replay_buffer.TFUniformReplayBuffer(
data_spec=agent.collect_data_spec,
batch_size=train_env.batch_size,
max_length=replay_buffer_max_length)
dataset = replay_buffer.as_dataset(
num_parallel_calls=3,
sample_batch_size=batch_size,
num_steps=2).prefetch(3)
iterator = iter(dataset)
agent.train_step_counter.assign(0)
avg_return = compute_avg_return(eval_env, agent.policy, num_eval_episodes)
returns = []
loss = []
for _ in range(num_iterations):
for _ in range(collect_steps_per_iteration):
collect_step(train_env, agent.collect_policy, replay_buffer)
experience, unused_info = next(iterator)
train_loss = agent.train(experience).loss
step = agent.train_step_counter.numpy()
avg_return = compute_avg_return(eval_env, agent.policy, num_eval_episodes)
returns.append(avg_return)
虽然平均reward越来越好,达到了200,最高分,但是最后,loss并没有明显下降。
这是损失图:
奖励图如下:
好的一点是模型成功了,玩游戏真的很爽。但是,我真的很想深入了解为什么会发生这种情况,即使是极高的损失仍然会产生丰厚的回报。
可能与您 Q 值的规模有关。我在我的 DQN 损失中有相同的行为,我的代理很容易解决环境问题,但损失在通过训练增加。
如果您查看 DQN 算法的这一部分,您可能会得到一些见解:
- 首先你会注意到目标 y 是建立在目标网络的 max Q 值之上的。正如 Double-DQN paper 中所证明的那样,它可能会导致对目标 Q 值的持续高估。由于目标可能会不断被高估而预测不会,因此预测和目标之间始终存在差异
- 其次,随着 Q 值的增长,此增量的规模也会扩大。我认为这是一种正常行为,因为你的 Q 函数会了解到许多状态都有一个重要的值,所以训练开始时的误差可能比结束时的误差小得多
- 第三,目标 Q 网络被冻结了一些步骤,而预测 Q 网络不断变化,这也促成了这个增量
希望这对您有所帮助,请注意,这纯粹是个人的直觉解释,我没有进行任何测试来验证我的假设。我认为第二点可能是最重要的。