在强化学习的策略梯度中反向传播了哪些损失或奖励？

Question

我在Python中做了一个小脚本来解决各种带有策略梯度的Gym环境。

import gym, os
import numpy as np
#create environment
env = gym.make('Cartpole-v0')
env.reset()
s_size = len(env.reset())
a_size = 2

#import my neural network code
os.chdir(r'C:\---\---\---\Python Code')
import RLPolicy
policy = RLPolicy.NeuralNetwork([s_size,a_size],learning_rate=0.000001,['softmax']) #a 3layer network might be ([s_size, 5, a_size],learning_rate=1,['tanh','softmax'])
#it supports the sigmoid activation function also
print(policy.weights)

DISCOUNT = 0.95 #parameter for discounting future rewards

#first step
action = policy.feedforward(env.reset)
state,reward,done,info = env.step(action)

for t in range(3000):
    done = False
    states = [] #lists for recording episode
    probs2 = []
    rewards = []
    while not done:
        #env.render() #to visualize learning

        probs = policy.feedforward(state)[-1] #calculate probabilities of actions
        action = np.random.choice(a_size,p=probs) #choose action from probs

        #record and update state
        probs2.append(probs) 
        states.append(state)
        state,reward,done,info = env.step(action)
        rewards.append(reward) #should reward be before updating state?

    #calculate gradients
    gradients_w = []
    gradients_b = []
    for i in range(len((rewards))):
        totalReward = sum([rewards[t]*DISCOUNT**t for t in range(len(rewards[i:]))]) #discounted reward
        ## !! this is the line that I need help with
        gradient = policy.backpropagation(states[i],totalReward*(probs2[i])) #what should be backpropagated through the network
        ## !!

        ##record gradients
        gradients_w.append(gradient[0])
        gradients_b.append(gradient[1])
    #combine gradients and update the weights and biases
    gradients_w = np.array(gradients_w,object)
    gradients_b = np.array(gradients_b,object)
    policy.weights += policy.learning_rate * np.flip(np.sum(gradients_w,0),0) #np.flip because the gradients are calculated backwards
    policy.biases += policy.learning_rate * np.flip(np.sum(gradients_b,0),0)
    #reset and record
    env.reset()
    if t%100==0:
        print('t'+str(t),'r',sum(rewards))

应该向后传递什么来计算梯度？我正在使用梯度上升，但我可以将其切换为下降。有人将奖励函数定义为totalReward*log(probabilities)。这会使分数导数 totalReward*(1/probs) 或 log(probs) 或其他什么？您是否使用像交叉熵这样的成本函数？我试过了
totalReward*np.log(probs)
totalReward*(1/probs)
totalReward*(probs**2)
totalReward*probs

probs = np.zeros(a_size)  
probs[action] = 1  
totalRewards*probs

和其他几个人。最后一个是唯一能够解决其中任何一个问题的方法，而且它只适用于 Cartpole。我已经在 Cartpole、Pendulum 和 MountainCar 上使用梯度上升和下降测试了数千个片段的各种损失函数或评分函数。有时它会改善一点点，但它永远不会解决它。我做错了什么？

这是 RLPolicy 代码。它写得不好或伪编码但我不认为这是问题，因为我用梯度检查检查了几次。但即使我可以将其缩小到神经网络或代码中其他地方的问题，它也会有所帮助。

#Neural Network
import numpy as np
import random, math, time, os
from matplotlib import pyplot as plt

def activation(x,function):
    if function=='sigmoid':
        return(1/(1+math.e**(-x))) #Sigmoid
    if function=='relu':
        x[x<0]=0
        return(x)
    if function=='tanh':
        return(np.tanh(x.astype(float))) #tanh
    if function=='softmax':
        z = np.exp(np.array((x-max(x)),float))
        y = np.sum(z)
    return(z/y)
def activationDerivative(x,function):
    if function=='sigmoid':
        return(x*(1-x))
    if function=='relu':
        x[x<0]==0
        x[x>0]==1
        return(x)
    if function=='tanh':
        return(1-x**2)
    if function=='softmax':
        s = x.reshape(-1,1)
        return(np.diagflat(s) - np.dot(s, s.T))

class NeuralNetwork():
    
    def __init__ (self,layers,learning_rate,momentum,regularization,activations):
        self.learning_rate = learning_rate   
        if (isinstance(layers[1],list)):
            h = layers[1][:]
            del layers[1]
            for i in h:
                layers.insert(-1,i)
        self.layers = layers
        self.weights = [2*np.random.rand(self.layers[i]*self.layers[i+1])-1 for i in range(len(self.layers)-1)]
        self.biases = [2*np.random.rand(self.layers[i+1])-1 for i in range(len(self.layers)-1)]    
        self.weights = np.array(self.weights,object)
        self.biases = np.array(self.biases,object)
        self.activations = activations
    def feedforward(self, input_array):
        layer = input_array
        neuron_outputs = [layer]
        for i in range(len(self.layers)-1):
            layer = np.tile(layer,self.layers[i+1])
            layer = np.reshape(layer,[self.layers[i+1],self.layers[i]])
            weights = np.reshape(self.weights[i],[self.layers[i+1],self.layers[i]])
            layer = weights*layer
            layer = np.sum(layer,1)#,self.layers[i+1]-1)
            layer = layer+self.biases[i]
            layer = activation(layer,self.activations[i])
            neuron_outputs.append(np.array(layer,float))
        return(neuron_outputs)
    def neuronErrors(self,l,neurons,layerError,n_os):
        if (l==len(self.layers)-2):
            return(layerError)
        totalErr = [] #total error
        for e in range(len(layerError)): #-layers
            e = e*self.layers[l+2]
            a_ws = self.weights[l+1][e:e+self.layers[l+1]]
            e = int(e/self.layers[l+2])
            err = layerError[e]*a_ws #error
            totalErr.append(err)
        return(sum(totalErr))
    def backpropagation(self,state,loss):
        weights_gradient = [np.zeros(self.layers[i]*self.layers[i+1]) for i in range(len(self.layers)-1)]
        biases_gradient = [np.zeros(self.layers[i+1]) for i in range(len(self.layers)-1)]  
        neuron_outputs = self.feedforward(state)
        grad = self.individualBackpropagation(loss, neuron_outputs)
        return(grad)

    def individualBackpropagation(self, difference, neuron_outputs): #number of output
        lr = self.learning_rate
        n_os = neuron_outputs[:]
        w_o = self.weights[:]
        b_o = self.biases[:]
        w_n = self.weights[:]
        b_n = self.biases[:]
        gradient_w = []
        gradient_b = []
        error = difference[:] #error for neurons
        for l in range(len(self.layers)-2,-1,-1):
            p_n = np.tile(n_os[l],self.layers[l+1]) #previous neuron
            neurons = np.arange(self.layers[l+1])
            error = (self.neuronErrors(l,neurons,error,n_os))
            if not self.activations[l]=='softmax':
                error = error*activationDerivative(neuron_outputs[l+1],self.activations[l])
            else:
                error = error @ activationDerivative(neuron_outputs[l+1],self.activations[l]) #because softmax derivative returns different dimensions
            w_grad = np.repeat(error,self.layers[l]) #weights gradient
            b_grad = np.ravel(error) #biases gradient
            w_grad = w_grad*p_n
            b_grad = b_grad
            gradient_w.append(w_grad)
            gradient_b.append(b_grad)
        return(gradient_w,gradient_b)

感谢您的回答，这是我的第一个问题。

Answer 1

这里的损失取决于每个问题的输出。通常，反向传播的损失应该是一个代表您已处理的所有内容的数字。对于策略梯度，它将是它认为它会得到的与原始奖励相比的奖励，日志只是一种将其带回概率随机变量的方法。单一维度。如果你想检查代码背后的行为，你应该经常检查每个进程之间的 shape/dimension 以充分理解

Answer 2

将此post用作梯度计算的参考（https://medium.com/@jonathan_hui/rl-policy-gradients-explained-9b13b688b146）：

在我看来 totalRewardOfEpisode*np.log(probability of sampled action) 是正确的计算。然而，为了对梯度有一个很好的估计，我建议使用许多情节来计算它。（例如 30，你只需要通过除以 30 来平均你的最终梯度）

与 totalReward*np.log(probs) 测试的主要区别在于，对于每个步骤，我认为您应该只反向传播您采样的动作的概率，而不是整个输出。最初在引用的文章中，他们使用总奖励，但最后他们建议像您一样使用当前和未来奖励的折扣奖励，因此这部分在理论上似乎没有问题。

旧答案：

据我所知，deepRL 方法使用一些对游戏中状态值或每个动作值的估计。从我在你的代码中看到的，你有一个神经网络，它只输出每个动作的概率。

虽然你想要的肯定是最大化总奖励，但由于环境的原因，你无法计算最终奖励的梯度。我建议您研究诸如 deepQLearning 之类的方法或基于 Actor/Critic 的方法（例如 PPO）。

根据你选择的方法，你会得到关于如何计算梯度的不同答案。

Answer 3

mprouveur 的回答只对了一半，但我觉得我需要解释反向传播的正确之处。在 ai.stackexchange.com 上对 my question 的回答是我如何理解这一点的。反向传播的正确误差是采取行动的对数概率乘以目标奖励。这也可以计算为输出概率与零数组之间的交叉熵损失，所采取的动作为 1。由于交叉熵损失的导数，这将具有仅推动概率的效果采取的行动更接近一个。然后，总奖励的乘积使得更好的动作被更多地推向更高的概率。因此，标签是 one-hot 编码向量，正确的方程是 label/probs * totalReward 因为这是交叉熵损失的导数和概率对数的导数。我在其他代码中得到了这个工作，但即使使用这个等式，我认为我的代码中的其他地方也是错误的。这可能与我如何通过组合交叉熵导数和 softmax 导数使 softmax 导数变得过于复杂而不是通常的计算方式有关。我会尽快用正确的代码和更多信息更新这个答案。

在强化学习的策略梯度中反向传播了哪些损失或奖励？

What Loss Or Reward Is Backpropagated In Policy Gradients For Reinforcement Learning?

python

reinforcement-learning

backpropagation

policy-gradient-descent