损失函数和深度学习

Loss function and deep learning

来自 deeplearning.ai:

The general methodology to build a Neural Network is to:

  1. Define the neural network structure ( # of input units, # of hidden units, etc).
  2. Initialize the model's parameters
  3. Loop:
    • Implement forward propagation
    • Compute loss
    • Implement backward propagation to get the gradients
    • Update parameters (gradient descent)

损失函数如何影响网络的学习方式?

例如,这是我认为正确的前向和反向传播实现,因为我可以使用以下代码训练模型以获得可接受的结果:

for i in range(number_iterations):


  # forward propagation


    Z1 = np.dot(weight_layer_1, xtrain) + bias_1
    a_1 = sigmoid(Z1)

    Z2 = np.dot(weight_layer_2, a_1) + bias_2
    a_2 = sigmoid(Z2)

    mse_cost = np.sum(cost_all_examples)
    cost_cross_entropy = -(1.0/len(X_train) * (np.dot(np.log(a_2), Y_train.T) + np.dot(np.log(1-a_2), (1-Y_train).T)))

#     Back propagation and gradient descent
    d_Z2 = np.multiply((a_2 - xtrain), d_sigmoid(a_2))
    d_weight_2 = np.dot(d_Z2, a_1.T)
    d_bias_2 = np.asarray(list(map(lambda x : [sum(x)] , d_Z2)))
    #   perform a parameter update in the negative gradient direction to decrease the loss
    weight_layer_2 = weight_layer_2 + np.multiply(- learning_rate , d_weight_2)
    bias_2 = bias_2 + np.multiply(- learning_rate , d_bias_2)

    d_a_1 = np.dot(weight_layer_2.T, d_Z2)
    d_Z1 = np.multiply(d_a_1, d_sigmoid(a_1))
    d_weight_1 = np.dot(d_Z1, xtrain.T)
    d_bias_1 = np.asarray(list(map(lambda x : [sum(x)] , d_Z1)))
    weight_layer_1 = weight_layer_1 + np.multiply(- learning_rate , d_weight_1)
    bias_1 = bias_1 + np.multiply(- learning_rate , d_bias_1)

注意以下几行:

mse_cost = np.sum(cost_all_examples)
cost_cross_entropy = -(1.0/len(X_train) * (np.dot(np.log(a_2), Y_train.T) + np.dot(np.log(1-a_2), (1-Y_train).T)))

我可以使用 mse 损失或交叉熵损失来了解系统的学习情况。但这仅供参考,成本函数的选择不会影响网络的学习方式。我相信我没有像深度学习文献中经常说的那样理解损失函数的选择是深度学习的重要一步?但如我上面的代码所示,我可以选择交叉熵或 mse 损失并且不会影响网络的学习方式,交叉熵或 mse 损失仅供参考?

更新:

例如,这里是 deeplearning.ai 中计算成本的代码片段:

# GRADED FUNCTION: compute_cost

def compute_cost(A2, Y, parameters):
    """
    Computes the cross-entropy cost given in equation (13)

    Arguments:
    A2 -- The sigmoid output of the second activation, of shape (1, number of examples)
    Y -- "true" labels vector of shape (1, number of examples)
    parameters -- python dictionary containing your parameters W1, b1, W2 and b2

    Returns:
    cost -- cross-entropy cost given equation (13)
    """

    m = Y.shape[1] # number of example

    # Retrieve W1 and W2 from parameters
    ### START CODE HERE ### (≈ 2 lines of code)
    W1 = parameters['W1']
    W2 = parameters['W2']
    ### END CODE HERE ###

    # Compute the cross-entropy cost
    ### START CODE HERE ### (≈ 2 lines of code)
    logprobs = np.multiply(np.log(A2), Y) + np.multiply((1 - Y), np.log(1 - A2))
    cost = - np.sum(logprobs) / m
    ### END CODE HERE ###

    cost = np.squeeze(cost)     # makes sure cost is the dimension we expect. 
                                # E.g., turns [[17]] into 17 
    assert(isinstance(cost, float))

    return cost

此代码按预期运行并实现了高精度/低成本。除了向机器学习工程师提供有关网络学习情况的信息外,此实现不使用成本值。这让我质疑成本函数的选择如何影响神经网络的学习方式?

好吧,这只是一个粗略的高级尝试来回答可能是 SO 的题外话问题(我理解你原则上的困惑)。

The value of the cost is not used in this implementation other than to offer information to machine learning engineer as to how well the network is learning.

这实际上是正确的;仔细阅读 Andrew Ng 的 compute_cost 函数的 Jupyter notebooks,你会看到:

5 - Cost function

Now you will implement forward and backward propagation. You need to compute the cost, because you want to check if your model is actually learning.

从字面上看,这是在您的代码中显式计算成本函数的实际值的唯一原因。

But this is for informational purposes only, the choice of cost function is not impacting how the network learns.

没那么快!这是(通常看不见的)陷阱:

成本函数的选择决定了用于计算 dwdb 的确切方程,因此是学习过程。

请注意,我在这里谈论的是 函数本身 ,而不是它的值。

换句话说,像你这样的计算

d_weight_2 = np.dot(d_Z2, a_1.T)

d_weight_1 = np.dot(d_Z1, xtrain.T)

不是从天上掉下来的,但它们是应用于特定成本函数.

的反向传播数学的结果

以下是 Andrew 在 Coursera 的入门课程中的一些相关高级幻灯片:

希望这对您有所帮助;我们如何精确地得出 dwdb 的特定计算形式的细节,从成本函数的导数开始超出了这个 post 的范围,但你可以找到几个很好的在线反向传播教程(here 是一个)。

最后,对于当我们选择错误的成本函数时会发生什么的(非常)高级描述(多class class化的二元交叉熵,而不是正确的分类交叉熵),你可以在 .

查看我的回答