theano hard_sigmoid() 打破梯度下降
theano hard_sigmoid() breaks gradient descent
为了突出问题,让我们关注这个 tutorial。
theano有3种计算张量sigmoid的方法,即sigmoid, ultra_fast_sigmoid and hard_sidmoid。似乎使用后两者打破了梯度下降算法。
传统的 sigmoid 会收敛,但其他的会出现奇怪的不一致行为。 ultra_fast_sigmoid,在尝试计算梯度时直接抛出错误“方法未定义('grad',ultra_fast_sigmoid)”,而 hard_sigmoid 编译正常,但无法收敛解决方案。
有人知道这种行为的来源吗?文档中没有强调应该发生这种情况,而且似乎违反直觉。
代码:
import theano
import theano.tensor as T
import theano.tensor.nnet as nnet
import numpy as np
x = T.dvector()
y = T.dscalar()
def layer(x, w):
b = np.array([1], dtype=theano.config.floatX)
new_x = T.concatenate([x, b])
m = T.dot(w.T, new_x) #theta1: 3x3 * x: 3x1 = 3x1 ;;; theta2: 1x4 * 4x1
h = nnet.sigmoid(m) ## THIS SIGMOID RIGHT HERE
return h
def grad_desc(cost, theta):
alpha = 0.1 #learning rate
return theta - (alpha * T.grad(cost, wrt=theta))
theta1 = theano.shared(np.array(np.random.rand(3,3), dtype=theano.config.floatX))
theta2 = theano.shared(np.array(np.random.rand(4,1), dtype=theano.config.floatX))
hid1 = layer(x, theta1) #hidden layer
out1 = T.sum(layer(hid1, theta2)) #output layer
fc = (out1 - y)**2 #cost expression
cost = theano.function(inputs=[x, y], outputs=fc, updates=[
(theta1, grad_desc(fc, theta1)),
(theta2, grad_desc(fc, theta2))])
run_forward = theano.function(inputs=[x], outputs=out1)
inputs = np.array([[0,1],[1,0],[1,1],[0,0]]).reshape(4,2) #training data X
exp_y = np.array([1, 1, 0, 0]) #training data Y
cur_cost = 0
for i in range(2000):
for k in range(len(inputs)):
cur_cost = cost(inputs[k], exp_y[k]) #call our Theano-compiled cost function, it will auto update weights
if i % 500 == 0: #only print the cost every 500 epochs/iterations (to save space)
print('Cost: %s' % (cur_cost,))
print(run_forward([0,1]))
print(run_forward([1,1]))
print(run_forward([1,0]))
print(run_forward([0,0]))
我更改了代码中的以下几行以缩短此 post 的输出(它们与教程不同,但已包含在上面的代码中):
from theano.tensor.nnet import binary_crossentropy as cross_entropy #imports
fc = cross_entropy(out1, y) #cost expression
for i in range(4000): #training iteration
乙状结肠
Cost: 1.62724279493
Cost: 0.545966632545
Cost: 0.156764560912
Cost: 0.0534911098234
Cost: 0.0280394147992
Cost: 0.0184933786794
Cost: 0.0136444190935
Cost: 0.0107482836159
0.993652087577
0.00848194143055
0.990829396285
0.00878482655791
ultra_fast_sigmoid
File "test.py", line 30, in <module>
(theta1, grad_desc(fc, theta1)),
File "test.py", line 19, in grad_desc
return theta - (alpha * T.grad(cost, wrt=theta))
File "/usr/local/lib/python2.7/dist-packages/theano/gradient.py", line 545, in grad
grad_dict, wrt, cost_name)
File "/usr/local/lib/python2.7/dist-packages/theano/gradient.py", line 1283, in _populate_grad_dict
rval = [access_grad_cache(elem) for elem in wrt]
File "/usr/local/lib/python2.7/dist-packages/theano/gradient.py", line 1241, in access_grad_cache
term = access_term_cache(node)[idx]
File "/usr/local/lib/python2.7/dist-packages/theano/gradient.py", line 951, in access_term_cache
output_grads = [access_grad_cache(var) for var in node.outputs]
File "/usr/local/lib/python2.7/dist-packages/theano/gradient.py", line 1241, in access_grad_cache
term = access_term_cache(node)[idx]
File "/usr/local/lib/python2.7/dist-packages/theano/gradient.py", line 951, in access_term_cache
output_grads = [access_grad_cache(var) for var in node.outputs]
File "/usr/local/lib/python2.7/dist-packages/theano/gradient.py", line 1241, in access_grad_cache
term = access_term_cache(node)[idx]
File "/usr/local/lib/python2.7/dist-packages/theano/gradient.py", line 951, in access_term_cache
output_grads = [access_grad_cache(var) for var in node.outputs]
File "/usr/local/lib/python2.7/dist-packages/theano/gradient.py", line 1241, in access_grad_cache
term = access_term_cache(node)[idx]
File "/usr/local/lib/python2.7/dist-packages/theano/gradient.py", line 951, in access_term_cache
output_grads = [access_grad_cache(var) for var in node.outputs]
File "/usr/local/lib/python2.7/dist-packages/theano/gradient.py", line 1241, in access_grad_cache
term = access_term_cache(node)[idx]
File "/usr/local/lib/python2.7/dist-packages/theano/gradient.py", line 951, in access_term_cache
output_grads = [access_grad_cache(var) for var in node.outputs]
File "/usr/local/lib/python2.7/dist-packages/theano/gradient.py", line 1241, in access_grad_cache
term = access_term_cache(node)[idx]
File "/usr/local/lib/python2.7/dist-packages/theano/gradient.py", line 1089, in access_term_cache
input_grads = node.op.grad(inputs, new_output_grads)
File "/usr/local/lib/python2.7/dist-packages/theano/tensor/elemwise.py", line 662, in grad
rval = self._bgrad(inputs, ograds)
File "/usr/local/lib/python2.7/dist-packages/theano/tensor/elemwise.py", line 737, in _bgrad
scalar_igrads = self.scalar_op.grad(scalar_inputs, scalar_ograds)
File "/usr/local/lib/python2.7/dist-packages/theano/scalar/basic.py", line 878, in grad
self.__class__.__name__)
theano.gof.utils.MethodNotDefined: ('grad', <class 'theano.tensor.nnet.sigm.UltraFastScalarSigmoid'>, 'UltraFastScalarSigmoid')
hard_sigmoid
Cost: 1.19810193303
Cost: 0.684360309062
Cost: 0.692614056124
Cost: 0.697902474354
Cost: 0.701540531661
Cost: 0.703807604483
Cost: 0.70470238116
Cost: 0.704385738831
0.4901260624
0.486248177053
0.489490785078
0.493368670425
这是hard_sigmoid
的源代码:
def hard_sigmoid(x):
"""An approximation of sigmoid.
More approximate and faster than ultra_fast_sigmoid.
Approx in 3 parts: 0, scaled linear, 1
Removing the slope and shift does not make it faster.
"""
# Use the same dtype as determined by "upgrade_to_float",
# and perform computation in that dtype.
out_dtype = scalar.upgrade_to_float(scalar.Scalar(dtype=x.dtype))[0].dtype
slope = tensor.constant(0.2, dtype=out_dtype)
shift = tensor.constant(0.5, dtype=out_dtype)
x = (x * slope) + shift
x = tensor.clip(x, 0, 1)
return x
所以只是实现为一个分段线性函数,在(-2.5, 2.5)范围内梯度为0.2,其他地方为0。这意味着如果输入落在区域 (-2.5, 2.5) 之外,它的梯度将为零,并且不会发生任何学习。
因此它可能不适合训练,但可用于近似预测结果。
编辑:
为了评估网络参数的梯度,通常我们使用 backpropagation.
这是一个非常简单的例子。
x = theano.tensor.scalar()
w = theano.shared(numpy.float32(1))
y = theano.tensor.nnet.hard_sigmoid(w*x) # y=w*x, w is initialized to 1.
dw = theano.grad(y, w) # gradient wrt w, which is equal to slope*x in this case
net = theano.function([x], [y, dw])
print net(-3)
print net(-1)
print net(0)
print net(1)
print net(3)
Output:
[array(0.0), array(-0.0)] # zero gradient because the slope is zero
[array(0.3), array(-0.2)]
[array(0.5), array(0.0)] # zero gradient because x is zero
[array(0.7), array(0.2)]
[array(1.0), array(0.0)] # zero gradient because the slope is zero
OP 编辑:
ultra_hard_sigmoid
失败,如果你查看源代码实现,因为它在 python 中是硬编码的,而不是由张量表达式处理。
为了突出问题,让我们关注这个 tutorial。
theano有3种计算张量sigmoid的方法,即sigmoid, ultra_fast_sigmoid and hard_sidmoid。似乎使用后两者打破了梯度下降算法。
传统的 sigmoid 会收敛,但其他的会出现奇怪的不一致行为。 ultra_fast_sigmoid,在尝试计算梯度时直接抛出错误“方法未定义('grad',ultra_fast_sigmoid)”,而 hard_sigmoid 编译正常,但无法收敛解决方案。
有人知道这种行为的来源吗?文档中没有强调应该发生这种情况,而且似乎违反直觉。
代码:
import theano
import theano.tensor as T
import theano.tensor.nnet as nnet
import numpy as np
x = T.dvector()
y = T.dscalar()
def layer(x, w):
b = np.array([1], dtype=theano.config.floatX)
new_x = T.concatenate([x, b])
m = T.dot(w.T, new_x) #theta1: 3x3 * x: 3x1 = 3x1 ;;; theta2: 1x4 * 4x1
h = nnet.sigmoid(m) ## THIS SIGMOID RIGHT HERE
return h
def grad_desc(cost, theta):
alpha = 0.1 #learning rate
return theta - (alpha * T.grad(cost, wrt=theta))
theta1 = theano.shared(np.array(np.random.rand(3,3), dtype=theano.config.floatX))
theta2 = theano.shared(np.array(np.random.rand(4,1), dtype=theano.config.floatX))
hid1 = layer(x, theta1) #hidden layer
out1 = T.sum(layer(hid1, theta2)) #output layer
fc = (out1 - y)**2 #cost expression
cost = theano.function(inputs=[x, y], outputs=fc, updates=[
(theta1, grad_desc(fc, theta1)),
(theta2, grad_desc(fc, theta2))])
run_forward = theano.function(inputs=[x], outputs=out1)
inputs = np.array([[0,1],[1,0],[1,1],[0,0]]).reshape(4,2) #training data X
exp_y = np.array([1, 1, 0, 0]) #training data Y
cur_cost = 0
for i in range(2000):
for k in range(len(inputs)):
cur_cost = cost(inputs[k], exp_y[k]) #call our Theano-compiled cost function, it will auto update weights
if i % 500 == 0: #only print the cost every 500 epochs/iterations (to save space)
print('Cost: %s' % (cur_cost,))
print(run_forward([0,1]))
print(run_forward([1,1]))
print(run_forward([1,0]))
print(run_forward([0,0]))
我更改了代码中的以下几行以缩短此 post 的输出(它们与教程不同,但已包含在上面的代码中):
from theano.tensor.nnet import binary_crossentropy as cross_entropy #imports
fc = cross_entropy(out1, y) #cost expression
for i in range(4000): #training iteration
乙状结肠
Cost: 1.62724279493
Cost: 0.545966632545
Cost: 0.156764560912
Cost: 0.0534911098234
Cost: 0.0280394147992
Cost: 0.0184933786794
Cost: 0.0136444190935
Cost: 0.0107482836159
0.993652087577
0.00848194143055
0.990829396285
0.00878482655791
ultra_fast_sigmoid
File "test.py", line 30, in <module>
(theta1, grad_desc(fc, theta1)),
File "test.py", line 19, in grad_desc
return theta - (alpha * T.grad(cost, wrt=theta))
File "/usr/local/lib/python2.7/dist-packages/theano/gradient.py", line 545, in grad
grad_dict, wrt, cost_name)
File "/usr/local/lib/python2.7/dist-packages/theano/gradient.py", line 1283, in _populate_grad_dict
rval = [access_grad_cache(elem) for elem in wrt]
File "/usr/local/lib/python2.7/dist-packages/theano/gradient.py", line 1241, in access_grad_cache
term = access_term_cache(node)[idx]
File "/usr/local/lib/python2.7/dist-packages/theano/gradient.py", line 951, in access_term_cache
output_grads = [access_grad_cache(var) for var in node.outputs]
File "/usr/local/lib/python2.7/dist-packages/theano/gradient.py", line 1241, in access_grad_cache
term = access_term_cache(node)[idx]
File "/usr/local/lib/python2.7/dist-packages/theano/gradient.py", line 951, in access_term_cache
output_grads = [access_grad_cache(var) for var in node.outputs]
File "/usr/local/lib/python2.7/dist-packages/theano/gradient.py", line 1241, in access_grad_cache
term = access_term_cache(node)[idx]
File "/usr/local/lib/python2.7/dist-packages/theano/gradient.py", line 951, in access_term_cache
output_grads = [access_grad_cache(var) for var in node.outputs]
File "/usr/local/lib/python2.7/dist-packages/theano/gradient.py", line 1241, in access_grad_cache
term = access_term_cache(node)[idx]
File "/usr/local/lib/python2.7/dist-packages/theano/gradient.py", line 951, in access_term_cache
output_grads = [access_grad_cache(var) for var in node.outputs]
File "/usr/local/lib/python2.7/dist-packages/theano/gradient.py", line 1241, in access_grad_cache
term = access_term_cache(node)[idx]
File "/usr/local/lib/python2.7/dist-packages/theano/gradient.py", line 951, in access_term_cache
output_grads = [access_grad_cache(var) for var in node.outputs]
File "/usr/local/lib/python2.7/dist-packages/theano/gradient.py", line 1241, in access_grad_cache
term = access_term_cache(node)[idx]
File "/usr/local/lib/python2.7/dist-packages/theano/gradient.py", line 1089, in access_term_cache
input_grads = node.op.grad(inputs, new_output_grads)
File "/usr/local/lib/python2.7/dist-packages/theano/tensor/elemwise.py", line 662, in grad
rval = self._bgrad(inputs, ograds)
File "/usr/local/lib/python2.7/dist-packages/theano/tensor/elemwise.py", line 737, in _bgrad
scalar_igrads = self.scalar_op.grad(scalar_inputs, scalar_ograds)
File "/usr/local/lib/python2.7/dist-packages/theano/scalar/basic.py", line 878, in grad
self.__class__.__name__)
theano.gof.utils.MethodNotDefined: ('grad', <class 'theano.tensor.nnet.sigm.UltraFastScalarSigmoid'>, 'UltraFastScalarSigmoid')
hard_sigmoid
Cost: 1.19810193303
Cost: 0.684360309062
Cost: 0.692614056124
Cost: 0.697902474354
Cost: 0.701540531661
Cost: 0.703807604483
Cost: 0.70470238116
Cost: 0.704385738831
0.4901260624
0.486248177053
0.489490785078
0.493368670425
这是hard_sigmoid
的源代码:
def hard_sigmoid(x):
"""An approximation of sigmoid.
More approximate and faster than ultra_fast_sigmoid.
Approx in 3 parts: 0, scaled linear, 1
Removing the slope and shift does not make it faster.
"""
# Use the same dtype as determined by "upgrade_to_float",
# and perform computation in that dtype.
out_dtype = scalar.upgrade_to_float(scalar.Scalar(dtype=x.dtype))[0].dtype
slope = tensor.constant(0.2, dtype=out_dtype)
shift = tensor.constant(0.5, dtype=out_dtype)
x = (x * slope) + shift
x = tensor.clip(x, 0, 1)
return x
所以只是实现为一个分段线性函数,在(-2.5, 2.5)范围内梯度为0.2,其他地方为0。这意味着如果输入落在区域 (-2.5, 2.5) 之外,它的梯度将为零,并且不会发生任何学习。
因此它可能不适合训练,但可用于近似预测结果。
编辑:
为了评估网络参数的梯度,通常我们使用 backpropagation.
这是一个非常简单的例子。
x = theano.tensor.scalar()
w = theano.shared(numpy.float32(1))
y = theano.tensor.nnet.hard_sigmoid(w*x) # y=w*x, w is initialized to 1.
dw = theano.grad(y, w) # gradient wrt w, which is equal to slope*x in this case
net = theano.function([x], [y, dw])
print net(-3)
print net(-1)
print net(0)
print net(1)
print net(3)
Output:
[array(0.0), array(-0.0)] # zero gradient because the slope is zero
[array(0.3), array(-0.2)]
[array(0.5), array(0.0)] # zero gradient because x is zero
[array(0.7), array(0.2)]
[array(1.0), array(0.0)] # zero gradient because the slope is zero
OP 编辑:
ultra_hard_sigmoid
失败,如果你查看源代码实现,因为它在 python 中是硬编码的,而不是由张量表达式处理。