使用 RLlib 时,如何防止我在评估运行期间收到的奖励金额重复出现?

How to prevent my reward sum received during evaluation runs repeating in intervals when using RLlib?

我正在使用 Ray 1.3.0for RLlib)与 SUMO 版本 1.9.2 的组合 用于模拟多代理场景。我已将 RLlib 配置为使用单个 PPO 网络 ,所有 N 代理通常 updated/used。我的评估设置如下所示:

# === Evaluation Settings ===
# Evaluate with every `evaluation_interval` training iterations.
# The evaluation stats will be reported under the "evaluation" metric key.
# Note that evaluation is currently not parallelized, and that for Ape-X
# metrics are already only reported for the lowest epsilon workers.

"evaluation_interval": 20,

# Number of episodes to run per evaluation period. If using multiple
# evaluation workers, we will run at least this many episodes total.

"evaluation_num_episodes": 10,

# Whether to run evaluation in parallel to a Trainer.train() call
# using threading. Default=False.
# E.g. evaluation_interval=2 -> For every other training iteration,
# the Trainer.train() and Trainer.evaluate() calls run in parallel.
# Note: This is experimental. Possible pitfalls could be race conditions
# for weight synching at the beginning of the evaluation loop.

"evaluation_parallel_to_training": False,

# Internal flag that is set to True for evaluation workers.

"in_evaluation": True,

# Typical usage is to pass extra args to evaluation env creator
# and to disable exploration by computing deterministic actions.
# IMPORTANT NOTE: Policy gradient algorithms are able to find the optimal
# policy, even if this is a stochastic one. Setting "explore=False" here
# will result in the evaluation workers not using this optimal policy!

"evaluation_config": {
    # Example: overriding env_config, exploration, etc:
    "lr": 0, # To prevent any kind of learning during evaluation
    "explore": True # As required by PPO (read IMPORTANT NOTE above)
},

# Number of parallel workers to use for evaluation. Note that this is set
# to zero by default, which means evaluation will be run in the trainer
# process (only if evaluation_interval is not None). If you increase this,
# it will increase the Ray resource usage of the trainer since evaluation
# workers are created separately from rollout workers (used to sample data
# for training).

"evaluation_num_workers": 1,

# Customize the evaluation method. This must be a function of signature
# (trainer: Trainer, eval_workers: WorkerSet) -> metrics: dict. See the
# Trainer.evaluate() method to see the default implementation. The
# trainer guarantees all eval workers have the latest policy state before
# this function is called.

"custom_eval_function": None,

发生的是每 20 次迭代(每次迭代收集“X”个训练样本),有一个 运行 的 最小值的评估10集。所有 N 智能体收到的奖励总和在这些情节中相加,并被设置为该特定评估的奖励总和 运行。随着时间的推移,我注意到有一种模式,奖励总和在相同的评估间隔 运行s 连续重复,学习无处可去。

更新(2021 年 6 月 23 日)

不幸的是,我没有为那个特定的 运行 激活 TensorBoard,但是从每 10 集的评估(每 20 次迭代发生一次)期间收集的平均奖励来看,很明显重复模式如下面的注释图所示:

场景中的 20 个智能体应该学习避免碰撞,但却继续以某种方式停滞在某个策略上,并最终在评估期间显示出完全相同的奖励序列?

这是我配置评估方面的一个特征,还是我应该检查其他东西?如果有人能给我建议或指出正确的方向,我将不胜感激。

谢谢。

会不会是由于多代理动态,你的政策正在追尾?你有多少政策?他们 competing/collaborating/neutral 是彼此吗? 请注意,多智能体训练可能非常不稳定,看到这些波动是很正常的,因为不同的策略得到更新,然后不得不面对不同的“env”-dynamics b/c(env=env+所有其他策略,也作为环境的一部分出现)。

第 1 步:我注意到当我出于某种原因在某个时候停止了 运行,然后从 保存的检查点重新启动它恢复后,TensorBoard 上的大多数(包括奖励)再次以完全相同的方式画出线条,这让它看起来像是重复的序列。

第 2 步:这让我相信我的检查点出了问题。我使用循环比较了检查点中的权重,瞧,它们都一样!没有任何变化! 所以要么是检查点的 saving/restoring 有问题,经过一番尝试后我发现情况并非如此。所以这只是意味着我的权重没有更新

第 3 步:我筛选了我的 训练配置 看看是否有什么东西阻止了网络学习,我注意到我已将我的 “多代理”配置选项“policies_to_train”设置为不存在的策略。不幸的是,要么 没有抛出 warning/error,要么抛出但我完全错过了。

解决步骤:通过正确设置多代理“policies_to_train”配置选项,它开始工作了!