如何实施动作掩蔽？

How does one implement action masking?

Actor Mimic 论文讨论了如何实施动作屏蔽程序。我引用

While playing a certain game, we mask out AMN action outputs that are not valid for that game and take the softmax over only the subset of valid actions

有没有人知道如何在 Tensorflow 中实现这个动作屏蔽？具体来说，如何仅在指定的动作子集上采用 softmax？

假设您有一个包含 1 和 0 的有效状态张量。

is_valid = [1, 0, 1, ...]

然后你有一个动作张量，你想在这些张量上对那些有效的值进行 softmax。您可以执行以下操作。

(tf.exp(actions) * is_valid) / (tf.reduce_sum(tf.exp(actions) * is_valid) + epsilon)

在这种情况下，is_valid 屏蔽了总和中的无效值。为了数值稳定性，我还会在除法中添加一个小的 epsilon，这样你就永远不能被零除。

您应该依赖内置的 softmax 函数。也就是说，您应该首先使用 boolean_mask 屏蔽张量中的无效动作，然后应用 softmax 函数。

（chasep255提供的解法存在数值问题。）

如何实施动作掩蔽？

How does one implement action masking?

reinforcement-learning

multitasking

tensorflow