在张量流中计算关于模型参数（包括输入的 CNN）的（RNN 的）新状态的梯度； tf.gradient return None

Question

总结：

我有一个从输入中提取特征的 1d CNN。 CNN 之后是 RNN。我正在寻找一种方法将梯度从 RNN 的 new_state 反向传播到 CNN 参数。此外，我们可以考虑一个内核大小为 [1, 1, input_num_features, output_num_features] 的卷积层。代码如下：

将 tensorflow 导入为 tf

mseed = 123
tf.set_random_seed(mseed)
kernel_initializer = tf.glorot_normal_initializer(seed=mseed)

# Graph Hyperparameters
cell_size = 64
num_classes = 2
m_dtype = tf.float32
num_features = 30

inputs_train_ph = tf.placeholder(dtype=m_dtype, shape=[None, 75, num_features], name="inputs_train_ph")
inputs_devel_ph = tf.placeholder(dtype=m_dtype, shape=[None, 75, num_features], name="inputs_devel_ph")

labels_train_ph = tf.placeholder(dtype=m_dtype, shape=[None, 75, num_classes], name="labels_train_ph")
labels_devel_ph = tf.placeholder(dtype=m_dtype, shape=[None, 75, num_classes], name="labels_devel_ph")

def return_inputs_train(): return inputs_train_ph
def return_inputs_devel(): return inputs_devel_ph
def return_labels_train(): return labels_train_ph
def return_labels_devel(): return labels_devel_ph

phase_train = tf.placeholder(tf.bool, shape=())
dropout = tf.placeholder(dtype=m_dtype, shape=())
initial_state = tf.placeholder(shape=[None, cell_size], dtype=m_dtype, name="initial_state")

inputs = tf.cond(phase_train, return_inputs_train, return_inputs_devel)
labels = tf.cond(phase_train, return_labels_train, return_labels_devel)

# Graph
def model(inputs):
    used = tf.sign(tf.reduce_max(tf.abs(inputs), 2))

    length = tf.reduce_sum(used, 1)
    length = tf.cast(length, tf.int32)

    with tf.variable_scope('layer_cell'):
        inputs = tf.layers.conv1d(inputs, filters=100, kernel_size=3, padding="same",
                                  kernel_initializer=tf.glorot_normal_initializer(seed=mseed))
        inputs = tf.layers.batch_normalization(inputs, training=phase_train, name="bn")
        inputs = tf.nn.relu(inputs)

    with tf.variable_scope('lstm_model'):

        cell = tf.nn.rnn_cell.GRUCell(cell_size, kernel_initializer=kernel_initializer)
        cell = tf.nn.rnn_cell.DropoutWrapper(cell, input_keep_prob=1.0 - dropout, state_keep_prob=1.0 - dropout)
        output, new_state = tf.nn.dynamic_rnn(cell, inputs, dtype=m_dtype, sequence_length=length,
                                              initial_state=initial_state)

    with tf.variable_scope("output"):
        output = tf.reshape(output, shape=[-1, cell_size])
        output = tf.layers.dense(output, units=num_classes,
                                 kernel_initializer=kernel_initializer)

        output = tf.reshape(output, shape=[5, -1, num_classes])
        used = tf.expand_dims(used, 2)

        output = output * used

    return output, new_state


output, new_state = model(inputs)

grads_new_state_wrt_vars = tf.gradients(new_state, tf.trainable_variables())
for g in grads_new_state_wrt_vars:
    print('**', g)

init_op = tf.global_variables_initializer()

with tf.Session() as sess:
    sess.run(init_op)

请注意，当我打印出梯度张量时，我得到了以下信息：

for g in grads_new_state_wrt_vars:
    print('**', g)

** None
** None
** None
** None
** Tensor("gradients/model/lstm_model/rnn/while/gru_cell/MatMul/Enter_grad/b_acc_3:0", shape=(220, 240), dtype=float64)
** Tensor("gradients/model/lstm_model/rnn/while/gru_cell/BiasAdd/Enter_grad/b_acc_3:0", shape=(240,), dtype=float64)
** Tensor("gradients/model/lstm_model/rnn/while/gru_cell/MatMul_1/Enter_grad/b_acc_3:0", shape=(220, 120), dtype=float64)
** Tensor("gradients/model/lstm_model/rnn/while/gru_cell/BiasAdd_1/Enter_grad/b_acc_3:0", shape=(120,), dtype=float64)
** None
** None

最后，网络中的权重打印如下：

for v in tf.trainable_variables():
    print(v.name)

model/conv1d/kernel:0
model/conv1d/bias:0
model/bn/gamma:0
model/bn/beta:0
model/lstm_model/rnn/gru_cell/gates/kernel:0
model/lstm_model/rnn/gru_cell/gates/bias:0
model/lstm_model/rnn/gru_cell/candidate/kernel:0
model/lstm_model/rnn/gru_cell/candidate/bias:0
model/output/dense/kernel:0
model/output/dense/bias:0

因此，为什么不能根据网络中第一个 conv 和 batch norm 层的权重计算梯度？

请注意，在 tf.gradients(new_state, tf.trainable_variables())

中将 new_state 替换为 output 时，我没有遇到同样的问题

非常感谢任何帮助！！

编辑

我发现如果我把上面定义的placeholders里面的None改一下，问题就解决了。我得到了 new_state wrt 到 conv 层的梯度。只要在训练和开发中定义的批量大小相同，这就会起作用 placeholders，例如：

inputs_train_ph = tf.placeholder(dtype=m_dtype, shape=[34, 75, num_features], name="inputs_train_ph")
inputs_devel_ph = tf.placeholder(dtype=m_dtype, shape=[34, 75, num_features], name="inputs_devel_ph")

labels_train_ph = tf.placeholder(dtype=m_dtype, shape=[34, 75, num_classes], name="labels_train_ph")
labels_devel_ph = tf.placeholder(dtype=m_dtype, shape=[34, 75, num_classes], name="labels_devel_ph")

不然我又会运行出错

请注意，输出 wrt conv 层的梯度不会受到上面定义的 placeholders 中批量大小的 None 的影响。

现在我想知道如果我没有将 None 更改为 batch_size 为什么会出现此错误？

Answer 1

我不确定这是否构成了对您问题的确切答案，但这可能是一个好的开始，因为您的问题似乎没有引起太多关注（嗯，说实话，我并不感到惊讶有这么多的混乱）。

此外，其他人可能也有很多问题（评论太多），所以我会在这里提出来（并尝试为手头的情况提供答案）。

Tensorflow 版本：1.12

1。 None 渐变

当 batch_size 未指定时 output 或 new_state 都不存在此问题。

渐变 w.r.t。 new_state，例如grads_new_state_wrt_vars = tf.gradients(new_state, tf.trainable_variables()) return:

** Tensor("gradients/layer_cell/conv1d/BiasAdd_grad/BiasAddGrad:0", shape=(100,), dtype=float32)
** Tensor("gradients/layer_cell/bn/batchnorm/mul_grad/Mul_1:0", shape=(100,), dtype=float32)
** Tensor("gradients/layer_cell/bn/batchnorm/add_1_grad/Reshape_1:0", shape=(100,), dtype=float32)
** Tensor("gradients/lstm_model/rnn/while/gru_cell/MatMul/Enter_grad/b_acc_3:0", shape=(164, 128), dtype=float32)
** Tensor("gradients/lstm_model/rnn/while/gru_cell/BiasAdd/Enter_grad/b_acc_3:0", shape=(128,), dtype=float32)
** Tensor("gradients/lstm_model/rnn/while/gru_cell/MatMul_1/Enter_grad/b_acc_3:0", shape=(164, 64), dtype=float32)
** Tensor("gradients/lstm_model/rnn/while/gru_cell/BiasAdd_1/Enter_grad/b_acc_3:0", shape=(64,), dtype=float32)
** None
** None

符合预期（因为它不通过网络的 Dense 部分）。

渐变 w.r.t。 output_state，例如grads_new_state_wrt_vars = tf.gradients(output_state, tf.trainable_variables()) return:

** Tensor("gradients/layer_cell/conv1d/BiasAdd_grad/BiasAddGrad:0", shape=(100,), dtype=float32)                                                                                           
** Tensor("gradients/layer_cell/bn/batchnorm/mul_grad/Mul_1:0", shape=(100,), dtype=float32)
** Tensor("gradients/layer_cell/bn/batchnorm/add_1_grad/Reshape_1:0", shape=(100,), dtype=float32)                                                                                         
** Tensor("gradients/lstm_model/rnn/while/gru_cell/MatMul/Enter_grad/b_acc_3:0", shape=(164, 128), dtype=float32)                                                                          
** Tensor("gradients/lstm_model/rnn/while/gru_cell/BiasAdd/Enter_grad/b_acc_3:0", shape=(128,), dtype=float32)                                                                             
** Tensor("gradients/lstm_model/rnn/while/gru_cell/MatMul_1/Enter_grad/b_acc_3:0", shape=(164, 64), dtype=float32)                                                                         
** Tensor("gradients/lstm_model/rnn/while/gru_cell/BiasAdd_1/Enter_grad/b_acc_3:0", shape=(64,), dtype=float32)                                                                            
** Tensor("gradients/output/dense/MatMul_grad/MatMul_1:0", shape=(64, 2), dtype=float32)
** Tensor("gradients/output/dense/BiasAdd_grad/BiasAddGrad:0", shape=(2,), dtype=float32)

再一次，一切都很好。

2。指定 batch_size

当批量大小按照您的描述指定时，例如

inputs_train_ph = tf.placeholder(dtype=m_dtype, shape=[34, 75, num_features], name="inputs_train_ph")
inputs_devel_ph = tf.placeholder(dtype=m_dtype, shape=[34, 75, num_features], name="inputs_devel_ph")

你的网络根本不工作，这并不奇怪，因为你的形状在输出命名空间中不匹配。

此外，labels_devel_ph 和 labels_train_ph 对于有问题的代码无关紧要，为什么要把它们放在这里，它们只会使问题更加混乱。请参阅 Minimal, Complete and Verifiable Example，有很多部分对于所讨论的任务完全不需要。

其次，inputs_train_ph和inputs_devel_ph的batch shape之间没有联系，为什么会有呢？一个独立于另一个，并且由于 tf.cond 一次只能使用一个（必须在会话运行中作为值提供，但这超出了这个问题的范围）。

有问题的输出部分，正是这样：

output = tf.reshape(output, shape=[5, -1, num_classes])
used = tf.expand_dims(used, 2)

output = output * used

使用您的方法的张量形状：

Output shape: (5, 480, 2)                                                                                                                          
Used initial shape: (32, 75) 
Used after expand_dims: (32, 75, 1)

显然 (5, 480, 2) 乘以 (32, 75, 1) 是行不通的，我看不出这行得通的可能性，即使 Tensorflow 处于 pre-alpha 版本，这让我觉得你的其他部分使它工作的源代码，老实说谁知道还有什么影响它。

used 的问题可以通过多种方式解决，但我认为您想要的是将 used 堆叠在另一个维度中，然后再整形（没有它就无法工作整形):

output = tf.reshape(output, shape=[5, -1, num_classes])
used = tf.stack((used, used), 2)
used = tf.reshape(used, shape=(5, -1, num_classes))

output = output * used

使用这种方法，每个批处理形状都可以毫无问题地通过网络。

顺便说一句。 我也不完全确定你一开始想要用 used 实现什么，但也许它有你的用例，IMO与最终结果相比，意图是完全不可读的（当序列中的所有特征都为零时列包含零，否则为一）。

3。 Tensorflow 版本

使用 Tensorflow 1.8 版 进行了测试（当前版本为 1.12，2.0 即将推出），我能够重现 你的问题.

仍然，如上所述，输出形状崩溃。

其实1.9版本已经解决了这个问题

为什么 1.8 中存在这个问题？

我已经尝试找出可能的原因，但我仍然不知道。一些想法：

使用 tf.layers 而不是 tf.nn。正如计划的版本 2.0 这个模块将被弃用，因为在框架，我认为在这种情况下也可能会出现问题。我已将 conv1d 更改为它的 tf.keras.layers 副本和批次标准化为 tf.nn.batch_normalization，不幸的是无济于事，结果还是一样。
根据1.9 发布笔记 Prevent tf.gradients() from backpropagating through integer tensors。也许这与您的问题有关？也许反向传播图以某种方式卡在 v1.8.0 中？
范围 tf.variable_scope 和 tf.layers 的问题 - 正如预期的那样，删除名称空间后没有任何变化。

总而言之：将你的依赖更新到1.9或以上版本，它解决了你所有的问题（虽然为什么它们会发生很难掌握，这个框架也是如此）。

实际上我不明白为什么这些更改是在次要版本之间进行的，而在更改日志中没有更深入地描述，但也许我遗漏了一些重要的东西...

有点跑题

也许您应该考虑使用 tf.Estimator、tf.keras 或 PyTorch 等其他框架？此代码极难阅读、难以调试且丑陋，存在较新的做法，它们应该对您和我们有所帮助（如果您的神经网络有其他问题并决定在 Whosebug 上提问）

在张量流中计算关于模型参数（包括输入的 CNN）的（RNN 的）新状态的梯度； tf.gradient return None

Computing the gradients of new state (of the RNN) with respect to model parameters, (including CNN for inputs), in tensorflow; tf.gradient return None

python

gradient

conv-neural-network

tensorflow

recurrent-neural-network

编辑