随机梯度下降未能在我的神经网络实现中收敛

Question

我一直在尝试使用具有平方误差和作为成本函数的随机梯度下降来构建神经网络，该神经网络使用能够表示此训练数据的前馈反向传播算法：

                 Input      Output
                {{0,1}  , {1,0,0,0,0,0,0,0}}
                {{0.1,1}, {0,1,0,0,0,0,0,0}}
                {{0.2,1}, {0,0,1,0,0,0,0,0}}
                {{0.3,1}, {0,0,0,1,0,0,0,0}}
                {{0.4,1}, {0,0,0,0,1,0,0,0}}
                {{0.5,1}, {0,0,0,0,0,1,0,0}}
                {{0.6,1}, {0,0,0,0,0,0,1,0}}
                {{0.7,1}, {0,0,0,0,0,0,0,1}}

其中它由1个输入单元，1个偏置单元，8个输出单元和总共16个权重组成（总共8个输入权重，8个偏置权重。每2个权重（1个来自输入，1个来自bias) 总共 16 个指的是各自的单个输出单元)。然而，集合收敛的速度非常慢。我对所有输出单元都使用了 sigmoid 激活函数：

output = 1/(1+e^(-weightedSum))

我得出的误差梯度是：

 errorGradient = learningRate*(output-trainingData) * output * (1-output)*inputUnit;

其中trainingData变量指的是训练集中在当前输出单元索引处指定的目标输出，inputUnit指的是连接到当前权重的输入单元。因此，我在每次迭代中使用以下等式更新每个单独的权重：

weights of i = weights of i - (learningRate * errorGradient)

代码：

package ann;


import java.util.Arrays;
import java.util.Random;

public class MSEANN {

static double learningRate= 0.1;
static double totalError=0;
static double previousTotalError=Double.POSITIVE_INFINITY;
static double[] weights;

public static void main(String[] args) {

    genRanWeights();

    double [][][] trainingData = {
            {{0,1}, {1,0,0,0,0,0,0,0}},
            {{0.1,1}, {0,1,0,0,0,0,0,0}},
            {{0.2,1}, {0,0,1,0,0,0,0,0}},
            {{0.3,1}, {0,0,0,1,0,0,0,0}},
            {{0.4,1}, {0,0,0,0,1,0,0,0}},
            {{0.5,1}, {0,0,0,0,0,1,0,0}},
            {{0.6,1}, {0,0,0,0,0,0,1,0}},
            {{0.7,1}, {0,0,0,0,0,0,0,1}},
    };


 while(true){

     int errorCount = 0;
     totalError=0;

     //Iterate through training set
     for(int i=0; i < trainingData.length; i++){
         //Iterate through a list of output unit
         for (int out=0 ; out < trainingData[i][1].length ; out++) {
             double weightedSum = 0;

             //Calculate weighted sum for this specific training set and this specific output unit
             for(int ii=0; ii < trainingData[i][0].length; ii++) {
                 weightedSum += trainingData[i][0][ii] * weights[out*(2)+ii];
             }

             //Calculate output
             double output = 1/(1+Math.exp(-weightedSum));

             double error = Math.pow(trainingData[i][1][out] - output,2)/2;

             totalError+=error;
             if(error >=0.001){
                 errorCount++;
             }



             //Iterate through a the training set to update weights
             for(int iii = out*2; iii < (out+1)*2; iii++) {
                 double firstGrad= -( trainingData[i][1][out] - output  ) * output*(1-output);
                 weights[iii] -= learningRate * firstGrad * trainingData[i][0][iii % 2];
             }

         }

     }


     //Total Error accumulated
     System.out.println(totalError);

     //If error is getting worse every iteration, terminate the program.
     if (totalError-previousTotalError>=0){
          System.out.println("FAIL TO CONVERGE");
          System.exit(0);
     }
     previousTotalError=totalError;

     if(errorCount == 0){
         System.out.println("Final weights: " + Arrays.toString(weights));
         System.exit(0);

     }

 }

}

//Generate random weights
static void genRanWeights() {
    Random r = new Random();
    double low  = -1/(Math.sqrt(2));
    double high = 1/(Math.sqrt(2));
    double[] result = new double[16];
    for(int i=0;i<result.length;i++)  {
        result[i] = low + (high-low)*r.nextDouble();
    }
    System.out.println(Arrays.toString(result));

     weights = result;
}

}

在上面的代码中，我通过打印在运行程序中累积的总误差来通过 ANN 进行调试，并且在每次迭代中显示误差在每次迭代中都在减少，但是在速度很慢。我调整了学习率，但影响不大。此外，我尝试将训练集简化为以下内容：

         Input      Output
        {{0  ,1}, {1,0,0,0,0,0,0,0}},
        {{0.1,1}, {0,1,0,0,0,0,0,0}},
//      {{0.2,1}, {0,0,1,0,0,0,0,0}},

网络训练得很好 quickly/instantly 并且能够重现目标结果。但是，如果取消对第 3 行的注释，训练进行得非常缓慢并且在程序运行期间根本不会收敛，即使我注意到误差总和正在减少。所以根据我上面的实验，我发现的模式是如果我使用 3 个训练集，它会花费很长时间，我什至从未注意到 ANN 完成训练。如果我使用小于 2 或恰好 2，网络能够立即产生正确的输出。

所以我的问题是，这是 'anomaly' 我观察到的是由于激活函数的错误选择，还是由于学习率的选择，或者仅仅是错误的实施？以及将来，您建议我应该针对此类问题进行有效调试的步骤是什么？

Answer 1

你的实现似乎是正确的，问题与学习率的选择无关。

问题来自单层感知器（无隐藏层）的局限性，无法解决非线性可分问题，如异或二元运算，除非我们使用特殊的激活函数使其工作XOR，但我不知道特殊的激活函数是否可以解决你的问题。要解决您的问题，您可能不得不选择另一种神经网络布局，例如多层感知器。

你给Single-Layer Perceptron的问题是在二维表面上线性不可分。当输入只有 2 个不同的值时，可以用一行分隔输出。但是对于输入有 3 个或更多不同的值，以及您想要的输出，一些输出需要两行与其他值分开。

例如，您网络的第二个输出神经元的二维图，以及 3 个可能的输入值，就像在您的测试中一样：

    ^
    |
    |      line 1      
    |        |   line 2
    |        |     |
    |        |     |
0.0 -     0  |  1  |  0    
    |        |     |
    |
    +-----|-----|-----|-----------> input values
         0.0   0.1   0.2

要将 1 与两个 0 分开，需要两行而不是一行。所以第二个神经元将无法产生所需的输出。

由于偏差始终具有相同的值，因此它不会影响问题并且不会出现在图表上。

如果您将目标输出更改为具有线性可分问题，则单层感知器将起作用：

{{0.0, 1}, {1,0,0,0,0,0,0,0}},
{{0.1, 1}, {1,1,0,0,0,0,0,0}},
{{0.2, 1}, {1,1,1,0,0,0,0,0}},
{{0.3, 1}, {1,1,1,1,0,0,0,0}},
{{0.4, 1}, {1,1,1,1,1,0,0,0}},
{{0.5, 1}, {1,1,1,1,1,1,0,0}},
{{0.6, 1}, {1,1,1,1,1,1,1,0}},
{{0.7, 1}, {1,1,1,1,1,1,1,1}},

在某些情况下，可以引入根据真实输入计算出的任意输入。例如，真实输入可能有 4 个值：

{{-1.0, 0.0, 1}, {1,0,0,0,0,0,0,0}},
{{-1.0, 0.1, 1}, {0,1,0,0,0,0,0,0}},
{{ 1.0, 0.2, 1}, {0,0,1,0,0,0,0,0}},
{{ 1.0, 0.3, 1}, {0,0,0,1,0,0,0,0}},

如果对于每个输出神经元，您绘制 X 轴上真实输入和 Y 轴上任意输入的图形，您将看到，对于代表输出的 4 个点，1只能与 0 分隔一行。

要处理真实输入的 8 个可能值，您可以添加第二个任意输入，并获得 3D 图形。在没有第二个任意输入的情况下处理 8 个可能值的另一种方法是将点放在圆上。例如：

double [][][] trainingData = {
  {{0.0, 0.0, 1}, {1,0,0,0,0,0,0,0}},
  {{0.0, 0.1, 1}, {0,1,0,0,0,0,0,0}},
  {{0.0, 0.2, 1}, {0,0,1,0,0,0,0,0}},
  {{0.0, 0.3, 1}, {0,0,0,1,0,0,0,0}},
  {{0.0, 0.4, 1}, {0,0,0,0,1,0,0,0}},
  {{0.0, 0.5, 1}, {0,0,0,0,0,1,0,0}},
  {{0.0, 0.6, 1}, {0,0,0,0,0,0,1,0}},
  {{0.0, 0.7, 1}, {0,0,0,0,0,0,0,1}},
};

for(int i=0; i<8;i++) {
  // multiply the true inputs by 8 before the sin/cos in order
  // to increase the distance between points, and multiply the
  // resulting sin/cos by 2 for the same reason
  trainingData[i][0][0] = 2.0*Math.cos(trainingData[i][0][1]*8.0);
  trainingData[i][0][1] = 2.0*Math.sin(trainingData[i][0][1]*8.0);
}

如果您不想或不能添加任意输入或修改目标输出，您将不得不选择另一种神经网络布局，如多层感知器。但也许一个特殊的激活函数可以用单层感知器解决你的问题。我用高斯试过，但没用，可能是参数错误。

And in the future, what are the steps you recommend I should to debug effectively for this type of problem?

考虑您选择的布局的局限性并尝试其他布局。如果您选择多层感知器，请考虑更改隐藏层的数量以及这些层中的神经元数量。

有时可以对网络的输入和输出进行归一化，在某些情况下它会大大提高性能，就像我用你的训练数据做的测试一样。但我认为在某些情况下，无论训练网络所需的时间如何，最好使用具有真实输入的训练网络。

我已经使用多层感知器测试了您的训练数据，该感知器具有一个由 15 个神经元组成的隐藏层，并且没有用于输出神经元的 S 型函数。在学习率为 0.1.

的大约 100 000 个训练周期后，我的网络收敛并停止在所需的错误处

如果我通过以下方式修改输入：

0   -> 0
0.1 -> 1
0.2 -> 2
0.3 -> 3
0.4 -> 4
0.5 -> 5
0.6 -> 6
0.7 -> 7

然后，我的网络收敛得更快。如果我将值转换为 [-7,7]:

范围，速度会更快

0   -> -7
0.1 -> -5
0.2 -> -3
0.3 -> -1
0.4 ->  1
0.5 ->  3
0.6 ->  5
0.7 ->  7

如果我修改目标输出，将 0s 替换为 -1:

会更快一些

{{-7,1}, { 1,-1,-1,-1,-1,-1,-1,-1}},
{{-5,1}, {-1, 1,-1,-1,-1,-1,-1,-1}},
{{-3,1}, {-1,-1, 1,-1,-1,-1,-1,-1}},
{{-1,1}, {-1,-1,-1, 1,-1,-1,-1,-1}},
{{ 1,1}, {-1,-1,-1,-1, 1,-1,-1,-1}},
{{ 3,1}, {-1,-1,-1,-1,-1, 1,-1,-1}},
{{ 5,1}, {-1,-1,-1,-1,-1,-1, 1,-1}},
{{ 7,1}, {-1,-1,-1,-1,-1,-1,-1, 1}},

通过这种输入和输出的规范化，我在大约 2000 个训练周期后得到了所需的错误，而没有规范化是 100 000 个。

另一个例子是你用训练数据的前两行实现，就像你的问题：

         Input      Output
        {{0  ,1}, {1,0,0,0,0,0,0,0}},
        {{0.1,1}, {0,1,0,0,0,0,0,0}},
//      {{0.2,1}, {0,0,1,0,0,0,0,0}},

需要大约 600 000 个训练周期才能获得所需的误差。但是如果我使用这些训练数据：

 Input      Output
{{0  ,1}, {1,0,0,0,0,0,0,0}},
{{1  ,1}, {0,1,0,0,0,0,0,0}},

用1代替输入0.1，只需要9000个训练周期。而且，如果我使用 10 而不是 0.1 和 -10 而不是 0，它只需要 1500 个训练周期。

但是，与我的多层感知器不同，将目标输出中的 0 替换为 -1 会破坏性能。

随机梯度下降未能在我的神经网络实现中收敛

Stochastic Gradient Descent failed to converge in my neural network implementation

java

math

machine-learning

neural-network

supervised-learning