为什么深度神经网络不能逼近简单的 ln(x) 函数?
Why deep NN can't approximate simple ln(x) function?
我创建了具有两个 RELU 隐藏层 + 线性激活层的 ANN,并尝试逼近简单的 ln(x) 函数。而我做不到这一点。我很困惑,因为 x:[0.0-1.0] 范围内的 lx(x) 应该没有问题地近似(我正在使用学习率 0.01 和基本梯度下降优化)。
import tensorflow as tf
import numpy as np
def GetTargetResult(x):
curY = np.log(x)
return curY
# Create model
def multilayer_perceptron(x, weights, biases):
# Hidden layer with RELU activation
layer_1 = tf.add(tf.matmul(x, weights['h1']), biases['b1'])
layer_1 = tf.nn.relu(layer_1)
# # Hidden layer with RELU activation
layer_2 = tf.add(tf.matmul(layer_1, weights['h2']), biases['b2'])
layer_2 = tf.nn.relu(layer_2)
# Output layer with linear activation
out_layer = tf.matmul(layer_2, weights['out']) + biases['out']
return out_layer
# Parameters
learning_rate = 0.01
training_epochs = 10000
batch_size = 50
display_step = 500
# Network Parameters
n_hidden_1 = 50 # 1st layer number of features
n_hidden_2 = 10 # 2nd layer number of features
n_input = 1
# Store layers weight & bias
weights = {
'h1': tf.Variable(tf.random_uniform([n_input, n_hidden_1])),
'h2': tf.Variable(tf.random_uniform([n_hidden_1, n_hidden_2])),
'out': tf.Variable(tf.random_uniform([n_hidden_2, 1]))
}
biases = {
'b1': tf.Variable(tf.random_uniform([n_hidden_1])),
'b2': tf.Variable(tf.random_uniform([n_hidden_2])),
'out': tf.Variable(tf.random_uniform([1]))
}
x_data = tf.placeholder(tf.float32, [None, 1])
y_data = tf.placeholder(tf.float32, [None, 1])
# Construct model
pred = multilayer_perceptron(x_data, weights, biases)
# Minimize the mean squared errors.
loss = tf.reduce_mean(tf.square(pred - y_data))
optimizer = tf.train.GradientDescentOptimizer(learning_rate)
train = optimizer.minimize(loss)
# Before starting, initialize the variables. We will 'run' this first.
init = tf.initialize_all_variables ()
# Launch the graph.
sess = tf.Session()
sess.run(init)
for step in range(training_epochs):
x_in = np.random.rand(batch_size, 1).astype(np.float32)
y_in = GetTargetResult(x_in)
sess.run(train, feed_dict = {x_data: x_in, y_data: y_in})
if(step % display_step == 0):
curX = np.random.rand(1, 1).astype(np.float32)
curY = GetTargetResult(curX)
curPrediction = sess.run(pred, feed_dict={x_data: curX})
curLoss = sess.run(loss, feed_dict={x_data: curX, y_data: curY})
print("For x = {0} and target y = {1} prediction was y = {2} and squared loss was = {3}".format(curX, curY,curPrediction, curLoss))
对于上面的配置,NN只是学习猜测y = -1.00。我尝试了不同的学习率、耦合优化器和不同的配置,但都没有成功——学习在任何情况下都不会收敛。我过去在其他深度学习框架中用对数做过类似的事情,没有问题。可能是 TF 的特定问题吗?我做错了什么?
我没有看过代码,但这是理论。如果您使用像 "tanh" 这样的激活函数,那么对于小权重,激活函数在线性区域,对于大权重,激活函数为 -1 或 +1。如果你在所有层的线性区域,那么你不能近似复杂的函数(即你有一个线性层的三明治,因此你能做的最好的是线性近似)但是如果你有更大的权重那么非线性允许你近似广泛的功能。天下没有免费的午餐,权重需要处于正确的值以避免过拟合和欠拟合。这个过程称为正则化。
首先,您的输入数据在 [0, 1) 范围内,这不是神经网络的良好输入。在计算 y
后从 x
中减去均值以使其归一化(理想情况下也除以标准差)。
但是,在您的特定情况下,这还不足以让它发挥作用。
我试过它并找到了两种让它工作的方法(都需要如上所述的数据规范化):
- 要么完全移除第二层
或
- 使第二层的神经元数量为50。
我的猜测是 10 个神经元没有足够的表示能力来将足够的信息传递到最后一层(显然,在这种情况下,一个非常聪明的 NN 会学会忽略第二层,将答案传递给其中一个神经元,但理论上的可能性并不意味着梯度下降将学会这样做)。
您的网络必须预测什么
来源:WolframAlpha
你的架构是什么
ReLU(ReLU(x * W_1 + b_1) * W_2 + b_2)*W_out + b_out
想法
我的第一个想法是 ReLU 是问题所在。但是,您没有将 relu 应用于输出,因此应该不会导致问题。
更改初始化(从统一到正常)和优化器(从 SGD 到 ADAM)似乎可以解决问题:
#!/usr/bin/env python
import tensorflow as tf
import numpy as np
def get_target_result(x):
return np.log(x)
def multilayer_perceptron(x, weights, biases):
"""Create model."""
# Hidden layer with RELU activation
layer_1 = tf.add(tf.matmul(x, weights['h1']), biases['b1'])
layer_1 = tf.nn.relu(layer_1)
# # Hidden layer with RELU activation
layer_2 = tf.add(tf.matmul(layer_1, weights['h2']), biases['b2'])
layer_2 = tf.nn.relu(layer_2)
# Output layer with linear activation
out_layer = tf.matmul(layer_2, weights['out']) + biases['out']
return out_layer
# Parameters
learning_rate = 0.01
training_epochs = 10**6
batch_size = 500
display_step = 500
# Network Parameters
n_hidden_1 = 50 # 1st layer number of features
n_hidden_2 = 10 # 2nd layer number of features
n_input = 1
# Store layers weight & bias
weights = {
'h1': tf.Variable(tf.truncated_normal([n_input, n_hidden_1], stddev=0.1)),
'h2': tf.Variable(tf.truncated_normal([n_hidden_1, n_hidden_2], stddev=0.1)),
'out': tf.Variable(tf.truncated_normal([n_hidden_2, 1], stddev=0.1))
}
biases = {
'b1': tf.Variable(tf.constant(0.1, shape=[n_hidden_1])),
'b2': tf.Variable(tf.constant(0.1, shape=[n_hidden_2])),
'out': tf.Variable(tf.constant(0.1, shape=[1]))
}
x_data = tf.placeholder(tf.float32, [None, 1])
y_data = tf.placeholder(tf.float32, [None, 1])
# Construct model
pred = multilayer_perceptron(x_data, weights, biases)
# Minimize the mean squared errors.
loss = tf.reduce_mean(tf.square(pred - y_data))
optimizer = tf.train.GradientDescentOptimizer(learning_rate)
# train = optimizer.minimize(loss)
train = tf.train.AdamOptimizer(1e-4).minimize(loss)
# Before starting, initialize the variables. We will 'run' this first.
init = tf.initialize_all_variables()
# Launch the graph.
sess = tf.Session()
sess.run(init)
for step in range(training_epochs):
x_in = np.random.rand(batch_size, 1).astype(np.float32)
y_in = get_target_result(x_in)
sess.run(train, feed_dict={x_data: x_in, y_data: y_in})
if(step % display_step == 0):
curX = np.random.rand(1, 1).astype(np.float32)
curY = get_target_result(curX)
curPrediction = sess.run(pred, feed_dict={x_data: curX})
curLoss = sess.run(loss, feed_dict={x_data: curX, y_data: curY})
print(("For x = {0} and target y = {1} prediction was y = {2} and "
"squared loss was = {3}").format(curX, curY,
curPrediction, curLoss))
训练这个 1 分钟给了我:
For x = [[ 0.19118255]] and target y = [[-1.65452647]] prediction was y = [[-1.65021849]] and squared loss was = 1.85587377928e-05
For x = [[ 0.17362741]] and target y = [[-1.75084364]] prediction was y = [[-1.74087048]] and squared loss was = 9.94640868157e-05
For x = [[ 0.60853624]] and target y = [[-0.4966988]] prediction was y = [[-0.49964082]] and squared loss was = 8.65551464813e-06
For x = [[ 0.33864763]] and target y = [[-1.08279514]] prediction was y = [[-1.08586168]] and squared loss was = 9.4036658993e-06
For x = [[ 0.79126364]] and target y = [[-0.23412406]] prediction was y = [[-0.24541236]] and squared loss was = 0.000127425722894
For x = [[ 0.09994856]] and target y = [[-2.30309963]] prediction was y = [[-2.29796076]] and squared loss was = 2.6408026315e-05
For x = [[ 0.31053194]] and target y = [[-1.16946852]] prediction was y = [[-1.17038012]] and squared loss was = 8.31002580526e-07
For x = [[ 0.0512077]] and target y = [[-2.97186542]] prediction was y = [[-2.96796203]] and squared loss was = 1.52364455062e-05
For x = [[ 0.120253]] and target y = [[-2.11815739]] prediction was y = [[-2.12729549]] and squared loss was = 8.35050013848e-05
所以答案可能是你的优化器不好/优化问题从一个坏点开始。参见
- Xavier Glorot、Yoshua Bengio:Understanding the difficulty of training deep feedforward neural networks
- Visualizing Optimization Algos
下图来自 Alec Radfords nice gifs。它不包含 ADAM,但你会感觉到它比 SGD 能做多少:
关于如何改进的两个想法
- 尝试辍学
- 尽量不要使用接近 0 的 x 值。我宁愿在 [0.01, 1] 中采样值。
然而,我对回归问题的经验非常有限。
我创建了具有两个 RELU 隐藏层 + 线性激活层的 ANN,并尝试逼近简单的 ln(x) 函数。而我做不到这一点。我很困惑,因为 x:[0.0-1.0] 范围内的 lx(x) 应该没有问题地近似(我正在使用学习率 0.01 和基本梯度下降优化)。
import tensorflow as tf
import numpy as np
def GetTargetResult(x):
curY = np.log(x)
return curY
# Create model
def multilayer_perceptron(x, weights, biases):
# Hidden layer with RELU activation
layer_1 = tf.add(tf.matmul(x, weights['h1']), biases['b1'])
layer_1 = tf.nn.relu(layer_1)
# # Hidden layer with RELU activation
layer_2 = tf.add(tf.matmul(layer_1, weights['h2']), biases['b2'])
layer_2 = tf.nn.relu(layer_2)
# Output layer with linear activation
out_layer = tf.matmul(layer_2, weights['out']) + biases['out']
return out_layer
# Parameters
learning_rate = 0.01
training_epochs = 10000
batch_size = 50
display_step = 500
# Network Parameters
n_hidden_1 = 50 # 1st layer number of features
n_hidden_2 = 10 # 2nd layer number of features
n_input = 1
# Store layers weight & bias
weights = {
'h1': tf.Variable(tf.random_uniform([n_input, n_hidden_1])),
'h2': tf.Variable(tf.random_uniform([n_hidden_1, n_hidden_2])),
'out': tf.Variable(tf.random_uniform([n_hidden_2, 1]))
}
biases = {
'b1': tf.Variable(tf.random_uniform([n_hidden_1])),
'b2': tf.Variable(tf.random_uniform([n_hidden_2])),
'out': tf.Variable(tf.random_uniform([1]))
}
x_data = tf.placeholder(tf.float32, [None, 1])
y_data = tf.placeholder(tf.float32, [None, 1])
# Construct model
pred = multilayer_perceptron(x_data, weights, biases)
# Minimize the mean squared errors.
loss = tf.reduce_mean(tf.square(pred - y_data))
optimizer = tf.train.GradientDescentOptimizer(learning_rate)
train = optimizer.minimize(loss)
# Before starting, initialize the variables. We will 'run' this first.
init = tf.initialize_all_variables ()
# Launch the graph.
sess = tf.Session()
sess.run(init)
for step in range(training_epochs):
x_in = np.random.rand(batch_size, 1).astype(np.float32)
y_in = GetTargetResult(x_in)
sess.run(train, feed_dict = {x_data: x_in, y_data: y_in})
if(step % display_step == 0):
curX = np.random.rand(1, 1).astype(np.float32)
curY = GetTargetResult(curX)
curPrediction = sess.run(pred, feed_dict={x_data: curX})
curLoss = sess.run(loss, feed_dict={x_data: curX, y_data: curY})
print("For x = {0} and target y = {1} prediction was y = {2} and squared loss was = {3}".format(curX, curY,curPrediction, curLoss))
对于上面的配置,NN只是学习猜测y = -1.00。我尝试了不同的学习率、耦合优化器和不同的配置,但都没有成功——学习在任何情况下都不会收敛。我过去在其他深度学习框架中用对数做过类似的事情,没有问题。可能是 TF 的特定问题吗?我做错了什么?
我没有看过代码,但这是理论。如果您使用像 "tanh" 这样的激活函数,那么对于小权重,激活函数在线性区域,对于大权重,激活函数为 -1 或 +1。如果你在所有层的线性区域,那么你不能近似复杂的函数(即你有一个线性层的三明治,因此你能做的最好的是线性近似)但是如果你有更大的权重那么非线性允许你近似广泛的功能。天下没有免费的午餐,权重需要处于正确的值以避免过拟合和欠拟合。这个过程称为正则化。
首先,您的输入数据在 [0, 1) 范围内,这不是神经网络的良好输入。在计算 y
后从 x
中减去均值以使其归一化(理想情况下也除以标准差)。
但是,在您的特定情况下,这还不足以让它发挥作用。
我试过它并找到了两种让它工作的方法(都需要如上所述的数据规范化):
- 要么完全移除第二层
或
- 使第二层的神经元数量为50。
我的猜测是 10 个神经元没有足够的表示能力来将足够的信息传递到最后一层(显然,在这种情况下,一个非常聪明的 NN 会学会忽略第二层,将答案传递给其中一个神经元,但理论上的可能性并不意味着梯度下降将学会这样做)。
您的网络必须预测什么
来源:WolframAlpha
你的架构是什么
ReLU(ReLU(x * W_1 + b_1) * W_2 + b_2)*W_out + b_out
想法
我的第一个想法是 ReLU 是问题所在。但是,您没有将 relu 应用于输出,因此应该不会导致问题。
更改初始化(从统一到正常)和优化器(从 SGD 到 ADAM)似乎可以解决问题:
#!/usr/bin/env python
import tensorflow as tf
import numpy as np
def get_target_result(x):
return np.log(x)
def multilayer_perceptron(x, weights, biases):
"""Create model."""
# Hidden layer with RELU activation
layer_1 = tf.add(tf.matmul(x, weights['h1']), biases['b1'])
layer_1 = tf.nn.relu(layer_1)
# # Hidden layer with RELU activation
layer_2 = tf.add(tf.matmul(layer_1, weights['h2']), biases['b2'])
layer_2 = tf.nn.relu(layer_2)
# Output layer with linear activation
out_layer = tf.matmul(layer_2, weights['out']) + biases['out']
return out_layer
# Parameters
learning_rate = 0.01
training_epochs = 10**6
batch_size = 500
display_step = 500
# Network Parameters
n_hidden_1 = 50 # 1st layer number of features
n_hidden_2 = 10 # 2nd layer number of features
n_input = 1
# Store layers weight & bias
weights = {
'h1': tf.Variable(tf.truncated_normal([n_input, n_hidden_1], stddev=0.1)),
'h2': tf.Variable(tf.truncated_normal([n_hidden_1, n_hidden_2], stddev=0.1)),
'out': tf.Variable(tf.truncated_normal([n_hidden_2, 1], stddev=0.1))
}
biases = {
'b1': tf.Variable(tf.constant(0.1, shape=[n_hidden_1])),
'b2': tf.Variable(tf.constant(0.1, shape=[n_hidden_2])),
'out': tf.Variable(tf.constant(0.1, shape=[1]))
}
x_data = tf.placeholder(tf.float32, [None, 1])
y_data = tf.placeholder(tf.float32, [None, 1])
# Construct model
pred = multilayer_perceptron(x_data, weights, biases)
# Minimize the mean squared errors.
loss = tf.reduce_mean(tf.square(pred - y_data))
optimizer = tf.train.GradientDescentOptimizer(learning_rate)
# train = optimizer.minimize(loss)
train = tf.train.AdamOptimizer(1e-4).minimize(loss)
# Before starting, initialize the variables. We will 'run' this first.
init = tf.initialize_all_variables()
# Launch the graph.
sess = tf.Session()
sess.run(init)
for step in range(training_epochs):
x_in = np.random.rand(batch_size, 1).astype(np.float32)
y_in = get_target_result(x_in)
sess.run(train, feed_dict={x_data: x_in, y_data: y_in})
if(step % display_step == 0):
curX = np.random.rand(1, 1).astype(np.float32)
curY = get_target_result(curX)
curPrediction = sess.run(pred, feed_dict={x_data: curX})
curLoss = sess.run(loss, feed_dict={x_data: curX, y_data: curY})
print(("For x = {0} and target y = {1} prediction was y = {2} and "
"squared loss was = {3}").format(curX, curY,
curPrediction, curLoss))
训练这个 1 分钟给了我:
For x = [[ 0.19118255]] and target y = [[-1.65452647]] prediction was y = [[-1.65021849]] and squared loss was = 1.85587377928e-05
For x = [[ 0.17362741]] and target y = [[-1.75084364]] prediction was y = [[-1.74087048]] and squared loss was = 9.94640868157e-05
For x = [[ 0.60853624]] and target y = [[-0.4966988]] prediction was y = [[-0.49964082]] and squared loss was = 8.65551464813e-06
For x = [[ 0.33864763]] and target y = [[-1.08279514]] prediction was y = [[-1.08586168]] and squared loss was = 9.4036658993e-06
For x = [[ 0.79126364]] and target y = [[-0.23412406]] prediction was y = [[-0.24541236]] and squared loss was = 0.000127425722894
For x = [[ 0.09994856]] and target y = [[-2.30309963]] prediction was y = [[-2.29796076]] and squared loss was = 2.6408026315e-05
For x = [[ 0.31053194]] and target y = [[-1.16946852]] prediction was y = [[-1.17038012]] and squared loss was = 8.31002580526e-07
For x = [[ 0.0512077]] and target y = [[-2.97186542]] prediction was y = [[-2.96796203]] and squared loss was = 1.52364455062e-05
For x = [[ 0.120253]] and target y = [[-2.11815739]] prediction was y = [[-2.12729549]] and squared loss was = 8.35050013848e-05
所以答案可能是你的优化器不好/优化问题从一个坏点开始。参见
- Xavier Glorot、Yoshua Bengio:Understanding the difficulty of training deep feedforward neural networks
- Visualizing Optimization Algos
下图来自 Alec Radfords nice gifs。它不包含 ADAM,但你会感觉到它比 SGD 能做多少:
关于如何改进的两个想法
- 尝试辍学
- 尽量不要使用接近 0 的 x 值。我宁愿在 [0.01, 1] 中采样值。
然而,我对回归问题的经验非常有限。