Adam optimizer error: one of the variables needed for gradient computation has been modified by an inplace operation

Question

我正在尝试实现 Actor-Critic 学习自动化算法，它与基本的 actor-critic 算法不同，它有一点变化。

反正我用的是Adam优化器，用pytorch实现的

当我先为 Critic 反向 TD 错误时，没有错误。但是，我对Actor进行反向损失，就出现了这个错误。

--------------------------------------------------------------------------- RuntimeError Traceback (most recent call last) in 46 # update Actor Func 47 optimizer_M.zero_grad() ---> 48 loss.backward() 49 optimizer_M.step() 50

~\Anaconda3\lib\site-packages\torch\tensor.py in backward(self, gradient, retain_graph, create_graph) 100 products. Defaults to False. 101 """ --> 102 torch.autograd.backward(self, gradient, retain_graph, create_graph) 103 104 def register_hook(self, hook):

~\Anaconda3\lib\site-packages\torch\autograd__init__.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables) 88 Variable._execution_engine.run_backward( 89 tensors, grad_tensors, retain_graph, create_graph, ---> 90 allow_unreachable=True) # allow_unreachable flag 91 92

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation

以上为错误内容

我试图找到就地操作，但在我编写的代码中没有找到。我想我不知道如何处理优化器。

这里是主要代码：

        for cur_step in range(1):   
        action = M_Agent(state, flag)  
        next_state, r = env.step(action)   

        # calculate TD Error
        TD_error = M_Agent.cal_td_error(r, next_state)

        # calculate Target
        target = torch.FloatTensor([M_Agent.cal_target(TD_error)])
        logit = M_Agent.cal_logit()
        loss = criterion(logit, target)

        # update value Func
        optimizer_M.zero_grad()
        TD_error.backward()
        optimizer_M.step()

        # update Actor Func
        loss.backward()
        optimizer_M.step()

这里是代理网络

    # Actor-Critic Agent
    self.act_pipe = nn.Sequential(nn.Linear(state, 128),
                            nn.ReLU(),
                            nn.Dropout(0.5),
                            nn.Linear(128, 256),
                            nn.ReLU(),
                            nn.Dropout(0.5),
                            nn.Linear(256, num_action),
                            nn.Softmax()
                            )

     self.val_pipe = nn.Sequential(nn.Linear(state, 128),
                            nn.ReLU(),
                            nn.Dropout(0.5),
                            nn.Linear(128, 256),
                            nn.ReLU(),
                            nn.Dropout(0.5),
                            nn.Linear(256, 1)
                            )


      def forward(self, state, flag, test=None):

          temp_action_prob = self.act_pipe(state)
          self.action_prob = self.cal_prob(temp_action_prob, flag)
          self.action = self.get_action(self.action_prob)
          self.value = self.val_pipe(state)

          return self.action

我想分别更新每个网络。

我想知道 Basic TD Actor-Critic 方法使用 TD 误差作为损失？或 r+V(s') 和 V(s) 之间的平方误差 ?

Answer 1

我认为问题是在前向传播之后，在向后调用之前将梯度归零。请注意，对于 automatic differentiation，您需要计算图和在前向传递过程中产生的中间结果。

因此，在您的 TD 误差和目标计算之前将梯度归零！而不是在你完成前向传播之后。

for cur_step in range(1): action = M_Agent(state, flag) next_state, r = env.step(action) optimizer_M.zero_grad() # zero your gradient here # calculate TD Error TD_error = M_Agent.cal_td_error(r, next_state) # calculate Target target = torch.FloatTensor([M_Agent.cal_target(TD_error)]) logit = M_Agent.cal_logit() loss = criterion(logit, target) # update value Func TD_error.backward() optimizer_M.step() # update Actor Func loss.backward() optimizer_M.step()

要回答你的第二个问题，例如 DDPG 算法使用平方误差（参见 paper）。

另一个推荐。在许多情况下，价值和策略网络的大部分在深度演员-评论家智能体中共享：直到最后一个隐藏层都具有相同的层，并使用单个线性输出进行价值预测，并使用 softmax 层进行动作分布。如果您有高维视觉输入，这将特别有用，因为它可以作为一种多任务学习，但您仍然可以尝试。（我看到你有一个低维状态向量）。

Adam optimizer error: one of the variables needed for gradient computation has been modified by an inplace operation

Adam optimizer error: one of the variables needed for gradient computation has been modified by an inplace operation

error-handling

optimization

reinforcement-learning

deep-learning

pytorch