为什么 Cartpole 游戏的 DQN 有上升的奖励而损失没有下降?

Why DQN for cartpole game has a ascending reward while loss is not descending?

我编写了一个 DQN 来使用 TensorFlow 和 tf_agents 玩 OpenAI 健身房推车杆游戏。代码如下所示:

def compute_avg_return(environment, policy, num_episodes=10):
    total_return = 0.0
    for _ in range(num_episodes):
        time_step = environment.reset()
        episode_return = 0.0
        while not time_step.is_last():
            action_step = policy.action(time_step)
            time_step = environment.step(action_step.action)
            episode_return += time_step.reward
        total_return += episode_return
    avg_return = total_return / num_episodes
    return avg_return.numpy()[0]


def collect_step(environment, policy, buffer):
    time_step = environment.current_time_step()
    action_step = policy.action(time_step)
    next_time_step = environment.step(action_step.action)
    traj = trajectory.from_transition(time_step, action_step, next_time_step)
    buffer.add_batch(traj)


def collect_data(env, policy, buffer, steps):
    for _ in range(steps):
        collect_step(env, policy, buffer)


def train_model(
    num_iterations=config.default_num_iterations,
    collect_steps_per_iteration=config.default_collect_steps_per_iteration,
    replay_buffer_max_length=config.default_replay_buffer_max_length,
    batch_size=config.default_batch_size,
    learning_rate=config.default_learning_rate,
    log_interval=config.default_log_interval,
    num_eval_episodes=config.default_num_eval_episodes,
    eval_interval=config.default_eval_interval,
    checkpoint_saver_directory=config.default_checkpoint_saver_directory,
    model_saver_directory=config.default_model_saver_directory,
    visualize=False,
    static_plot=False,
):
    env_name = 'CartPole-v0'
    train_py_env = suite_gym.load(env_name)
    eval_py_env = suite_gym.load(env_name)
    train_env = tf_py_environment.TFPyEnvironment(train_py_env)
    eval_env = tf_py_environment.TFPyEnvironment(eval_py_env)
    fc_layer_params = (100,)
    q_net = q_network.QNetwork(
        train_env.observation_spec(),
        train_env.action_spec(),
        fc_layer_params=fc_layer_params)
    optimizer = Adam(learning_rate=learning_rate)
    train_step_counter = tf.Variable(0)
    agent = dqn_agent.DqnAgent(
        train_env.time_step_spec(),
        train_env.action_spec(),
        q_network=q_net,
        optimizer=optimizer,
        td_errors_loss_fn=common.element_wise_squared_loss,
        train_step_counter=train_step_counter)
    agent.initialize()
    replay_buffer = tf_uniform_replay_buffer.TFUniformReplayBuffer(
        data_spec=agent.collect_data_spec,
        batch_size=train_env.batch_size,
        max_length=replay_buffer_max_length)
    dataset = replay_buffer.as_dataset(
        num_parallel_calls=3,
        sample_batch_size=batch_size,
        num_steps=2).prefetch(3)
    iterator = iter(dataset)
    agent.train_step_counter.assign(0)
    avg_return = compute_avg_return(eval_env, agent.policy, num_eval_episodes)
    returns = []
    loss = []
    for _ in range(num_iterations):
        for _ in range(collect_steps_per_iteration):
            collect_step(train_env, agent.collect_policy, replay_buffer)
        experience, unused_info = next(iterator)
        train_loss = agent.train(experience).loss
        step = agent.train_step_counter.numpy()
        avg_return = compute_avg_return(eval_env, agent.policy, num_eval_episodes)
        returns.append(avg_return)

虽然平均reward越来越好,达到了200,最高分,但是最后,loss并没有明显下降。

这是损失图:

奖励图如下:

好的一点是模型成功了,玩游戏真的很爽。但是,我真的很想深入了解为什么会发生这种情况,即使是极高的损失仍然会产生丰厚的回报。

可能与您 Q 值的规模有关。我在我的 DQN 损失中有相同的行为,我的代理很容易解决环境问题,但损失在通过训练增加。

如果您查看 DQN 算法的这一部分,您可能会得到一些见解:

  • 首先你会注意到目标 y 是建立在目标网络的 max Q 值之上的。正如 Double-DQN paper 中所证明的那样,它可能会导致对目标 Q 值的持续高估。由于目标可能会不断被高估而预测不会,因此预测和目标之间始终存在差异
  • 其次,随着 Q 值的增长,此增量的规模也会扩大。我认为这是一种正常行为,因为你的 Q 函数会了解到许多状态都有一个重要的值,所以训练开始时的误差可能比结束时的误差小得多
  • 第三,目标 Q 网络被冻结了一些步骤,而预测 Q 网络不断变化,这也促成了这个增量

希望这对您有所帮助,请注意,这纯粹是个人的直觉解释,我没有进行任何测试来验证我的假设。我认为第二点可能是最重要的。