RLLib - Tensorflow - InvalidArgumentError: Received a label value of N which is outside the valid range of [0, N)

Question

我在自定义环境中使用 RLLib 的 PPOTrainer，我执行了两次 trainer.train()，第一次成功完成，但是当我第二次执行时它崩溃并出现错误：

lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call (pid=15248) raise type(e)(node_def, op, message) (pid=15248)

tensorflow.python.framework.errors_impl.InvalidArgumentError:

Received a label value of 5 which is outside the valid range of [0, 5). >Label values: 5 5

(pid=15248) [[node default_policy/SparseSoftmaxCrossEntropyWithLogits/SparseSoftmaxCrossEntropyWithLogits (defined at /tensorflow_core/python/framework/ops.py:1751) ]]

这是我的代码：

main.py

ModelCatalog.register_custom_preprocessor("tree_obs_prep", TreeObsPreprocessor)
ray.init()

trainer = PPOTrainer(env=MyEnv, config={
    "train_batch_size": 4000,
    "model": {
        "custom_preprocessor": "tree_obs_prep"
    }
})

for i in range(2):
    print(trainer.train())

MyEnv.py

class MyEnv(rllib.env.MultiAgentEnv):
    def __init__(self, env_config):
        self.n_agents = 2

        self.env = *CREATES ENV*
        self.action_space = gym.spaces.Discrete(5)
        self.observation_space = np.zeros((1, 12))

    def reset(self):
        self.agents_done = []
        obs = self.env.reset()
        return obs[0]

    def step(self, action_dict):
        obs, rewards, dones, infos = self.env.step(action_dict)

        d = dict()
        r = dict()
        o = dict()
        i = dict()
        for i_agent in range(len(self.env.agents)):
            if i_agent not in self.agents_done:
                o[i_agent] = obs[i_agent]
                r[i_agent] = rewards[i_agent]
                d[i_agent] = dones[i_agent]
                i[i_agent] = infos[i)agent]
        d['__all__'] = dones['__all__']

        for agent, done in dones.items():
            if done and agent != '__all__':
                self.agents_done.append(agent)

        return  o, r, d, i

我不知道问题出在哪里，有什么建议吗？这个错误是什么意思？

Answer 1

This 评论对我很有帮助：

FWIW, I think such issues can happen if NaNs appear in the policy output. When that happens, you can get out of range errors.

Usually it's due to the observation or reward somehow becoming NaN, though it could be the policy diverging as well.

在我的例子中，我不得不修改我的观察结果，因为代理无法学习策略，并且在训练的某个时刻（在随机时间步长）返回的动作是 NaN。