如何在 Keras/TensorFlow 中可视化 RNN/LSTM 渐变？

Question

我看到研究出版物和问答讨论了检查每个时间反向传播 (BPTT) 的 RNN 梯度的必要性——即 每个时间步 的梯度。主要用途是内省：我们怎么知道一个RNN是否在学习长期依赖？一个独立主题的问题，但最重要的见解是 梯度流:

如果非零梯度流过每个时间步长，那么每个时间步长都有助于学习 - 即，结果梯度源于对每个输入时间步长的解释，因此整个序列影响权重更新
根据上述内容，RNN 不再忽略长序列的部分，并被迫从中学习

...但是我如何在 Keras/TensorFlow 中实际可视化这些梯度？一些相关的答案是正确的，但它们似乎对双向 RNN 失败了，并且只展示了如何获得层的梯度，而不是如何有意义地可视化它们（输出是 3D 张量 - 我如何绘制它？）

Answer 1

可以获取渐变 w.r.t。 weights 或 outputs - 我们将需要后者。此外，为了获得最佳结果，需要特定于体系结构的处理。下面的代码和解释涵盖 Keras/TF RNN 的所有可能情况，并且应该很容易扩展到任何未来的 API 变化。

完整性：显示的代码是一个简化版本 - 完整版本可以在我的存储库中找到，See RNN（这个 post 包含更大的图片）;包括：

更好的视觉定制能力
解释所有功能的文档字符串
支持 Eager、Graph、TF1、TF2 和 from keras & from tf.keras
激活可视化
权重梯度可视化（即将推出）
权重可视化（即将推出）

I/O 维度（所有 RNN）：

输入：(batch_size, timesteps, channels) - 或者，等价地，(samples, timesteps, features)
输出：与输入相同，除了：
- channels/features 现在是 # 个 RNN 单元，并且：
- return_sequences=True --> timesteps_out = timesteps_in（为每个输入时间步输出一个预测）
- return_sequences=False --> timesteps_out = 1（仅在处理的最后一个时间步输出预测）

可视化方法:

一维绘图网格：绘制每个通道的梯度与时间步长
2D 热图：绘制通道与时间步长 w/梯度强度热图
0D 对齐散点图：每个样本每个通道的梯度图
~~直方图~~：没有很好的方法来表示"vs. timesteps"关系
一个样本：对单个样本执行上述每个操作
整个批次：对一个批次中的所有样品做上述每一个；需要小心对待

# for below examples
grads = get_rnn_gradients(model, x, y, layer_idx=1) # return_sequences=True
grads = get_rnn_gradients(model, x, y, layer_idx=2) # return_sequences=False

EX 1：一个样本，uni-LSTM，6 个单元 -- return_sequences=True，训练了 20 次迭代
show_features_1D(grads[0], n_rows=2)

注意：梯度将被读取从右到左，因为它们是计算的（从最后一个时间步到第一个）
最右边（最新）的时间步长始终具有较高的梯度
梯度消失：~75% 的最左边时间步长为零梯度，表明时间依赖性学习较差

EX 2：所有 (16) 个样本，uni-LSTM，6 个单元 -- return_sequences=True，训练了 20 次迭代
show_features_1D(grads, n_rows=2)
show_features_2D(grads, n_rows=4, norm=(-.01, .01))

每个样本以不同的颜色显示（但每个样本跨通道的颜色相同）
一些样本比上面显示的样本表现更好，但相差不大
热图绘制通道（y 轴）与时间步长（x 轴）；蓝色=-0.01，红色=0.01，白色=0（渐变值）

EX 3：所有 (16) 个样本，uni-LSTM，6 个单元 -- return_sequences=True，训练了 200 次迭代
show_features_1D(grads, n_rows=2)
show_features_2D(grads, n_rows=4, norm=(-.01, .01))

两个图都显示 LSTM 在 180 次额外迭代后表现明显更好
梯度仍然消失了大约一半的时间步长
所有 LSTM 单元都能更好地捕获一个特定样本（蓝色曲线，所有图）的时间依赖性——我们可以从热图中看出它是第一个样本。我们可以绘制该样本与其他样本的对比图，以尝试了解差异

EX 4：2D 与 1D，uni-LSTM：256 个单元，return_sequences=True，训练 200 次迭代
show_features_1D(grads[0])
show_features_2D(grads[:, :, 0], norm=(-.0001, .0001))

2D 更适合比较少数样本的多个通道
1D 更适合比较几个通道中的多个样本

EX 5：双 GRU，256 个单元（总共 512 个） -- return_sequences=True，训练了 400 次迭代
show_features_2D(grads[0], norm=(-.0001, .0001), reflect_half=True)

后向层的梯度被翻转以保持一致性w.r.t。时间轴
绘图揭示了 Bi-RNN 鲜为人知的优势 - 信息效用：集体梯度覆盖了大约两倍的数据。但是，这不是免费的午餐：每一层都是一个独立的特征提取器，所以学习并不是真正的补充
更多单位的较低 norm 预计，大约。相同的损失派生梯度分布在更多参数上（因此平方数值更少）

EX 6：0D，所有 (16) 个样本，uni-LSTM，6 个单元 -- return_sequences=False，训练 200 次迭代
show_features_0D(grads)

return_sequences=False 仅利用最后一个时间步的梯度（它仍然来自所有时间步，除非使用 t运行cated BPTT），需要一种新方法
对样本中的每个 RNN 单元进行一致的颜色编码以进行比较（可以使用一种颜色代替）
评估梯度流不是那么直接，而是涉及更多的理论知识。一种简单的方法是比较训练开始时和后期的分布：如果差异不显着，则 RNN 在学习长期依赖性方面表现不佳

EX 7：LSTM 对比 GRU 对比 SimpleRNN，unidir，256 个单位 -- return_sequences=True，训练 250 次迭代
show_features_2D(grads, n_rows=8, norm=(-.0001, .0001), show_xy_ticks=[0,0], show_title=False)

注：比较意义不大；每个网络都在不同的超参数下茁壮成长，而相同的超参数被用于所有网络。 LSTM，例如，每个单元具有最多的参数，淹没了 SimpleRNN
在此设置中，LSTM 最终击败了 GRU 和 SimpleRNN

可视化函数:

def get_rnn_gradients(model, input_data, labels, layer_idx=None, layer_name=None, 
                      sample_weights=None):
    if layer is None:
        layer = _get_layer(model, layer_idx, layer_name)

    grads_fn = _make_grads_fn(model, layer, mode)
    sample_weights = sample_weights or np.ones(len(input_data))
    grads = grads_fn([input_data, sample_weights, labels, 1])

    while type(grads) == list:
        grads = grads[0]
    return grads

def _make_grads_fn(model, layer):
    grads = model.optimizer.get_gradients(model.total_loss, layer.output)
    return K.function(inputs=[model.inputs[0],  model.sample_weights[0],
                              model._feed_targets[0], K.learning_phase()], outputs=grads) 

def _get_layer(model, layer_idx=None, layer_name=None):
    if layer_idx is not None:
        return model.layers[layer_idx]

    layer = [layer for layer in model.layers if layer_name in layer.name]
    if len(layer) > 1:
        print("WARNING: multiple matching layer names found; "
              + "picking earliest")
    return layer[0]


def show_features_1D(data, n_rows=None, label_channels=True,
                     equate_axes=True, max_timesteps=None, color=None,
                     show_title=True, show_borders=True, show_xy_ticks=[1,1], 
                     title_fontsize=14, channel_axis=-1, 
                     scale_width=1, scale_height=1, dpi=76):
    def _get_title(data, show_title):
        if len(data.shape)==3:
            return "((Gradients vs. Timesteps) vs. Samples) vs. Channels"
        else:        
            return "((Gradients vs. Timesteps) vs. Channels"

    def _get_feature_outputs(data, subplot_idx):
        if len(data.shape)==3:
            feature_outputs = []
            for entry in data:
                feature_outputs.append(entry[:, subplot_idx-1][:max_timesteps])
            return feature_outputs
        else:
            return [data[:, subplot_idx-1][:max_timesteps]]

    if len(data.shape)!=2 and len(data.shape)!=3:
        raise Exception("`data` must be 2D or 3D")

    if len(data.shape)==3:
        n_features = data[0].shape[channel_axis]
    else:
        n_features = data.shape[channel_axis]
    n_cols = int(n_features / n_rows)

    if color is None:
        n_colors = len(data) if len(data.shape)==3 else 1
        color = [None] * n_colors

    fig, axes = plt.subplots(n_rows, n_cols, sharey=equate_axes, dpi=dpi)
    axes = np.asarray(axes)

    if show_title:
        title = _get_title(data, show_title)
        plt.suptitle(title, weight='bold', fontsize=title_fontsize)
    fig.set_size_inches(12*scale_width, 8*scale_height)

    for ax_idx, ax in enumerate(axes.flat):
        feature_outputs = _get_feature_outputs(data, ax_idx)
        for idx, feature_output in enumerate(feature_outputs):
            ax.plot(feature_output, color=color[idx])

        ax.axis(xmin=0, xmax=len(feature_outputs[0]))
        if not show_xy_ticks[0]:
            ax.set_xticks([])
        if not show_xy_ticks[1]:
            ax.set_yticks([])
        if label_channels:
            ax.annotate(str(ax_idx), weight='bold',
                        color='g', xycoords='axes fraction',
                        fontsize=16, xy=(.03, .9))
        if not show_borders:
            ax.set_frame_on(False)

    if equate_axes:
        y_new = []
        for row_axis in axes:
            y_new += [np.max(np.abs([col_axis.get_ylim() for
                                     col_axis in row_axis]))]
        y_new = np.max(y_new)
        for row_axis in axes:
            [col_axis.set_ylim(-y_new, y_new) for col_axis in row_axis]
    plt.show()


def show_features_2D(data, n_rows=None, norm=None, cmap='bwr', reflect_half=False,
                     timesteps_xaxis=True, max_timesteps=None, show_title=True,
                     show_colorbar=False, show_borders=True, 
                     title_fontsize=14, show_xy_ticks=[1,1],
                     scale_width=1, scale_height=1, dpi=76):
    def _get_title(data, show_title, timesteps_xaxis, vmin, vmax):
        if timesteps_xaxis:
            context_order = "(Channels vs. %s)" % "Timesteps"
        if len(data.shape)==3:
            extra_dim = ") vs. Samples"
            context_order = "(" + context_order
        return "{} vs. {}{} -- norm=({}, {})".format(context_order, "Timesteps",
                                                     extra_dim, vmin, vmax)

    vmin, vmax = norm or (None, None)
    n_samples = len(data) if len(data.shape)==3 else 1
    n_cols = int(n_samples / n_rows)

    fig, axes = plt.subplots(n_rows, n_cols, dpi=dpi)
    axes = np.asarray(axes)

    if show_title:
        title = _get_title(data, show_title, timesteps_xaxis, vmin, vmax)
        plt.suptitle(title, weight='bold', fontsize=title_fontsize)

    for ax_idx, ax in enumerate(axes.flat):
        img = ax.imshow(data[ax_idx], cmap=cmap, vmin=vmin, vmax=vmax)
        if not show_xy_ticks[0]:
            ax.set_xticks([])
        if not show_xy_ticks[1]:
            ax.set_yticks([])
        ax.axis('tight')
        if not show_borders:
            ax.set_frame_on(False)

    if show_colorbar:
        fig.colorbar(img, ax=axes.ravel().tolist())

    plt.gcf().set_size_inches(8*scale_width, 8*scale_height)
    plt.show()


def show_features_0D(data, marker='o', cmap='bwr', color=None,
                     show_y_zero=True, show_borders=False, show_title=True,
                     title_fontsize=14, markersize=15, markerwidth=2,
                     channel_axis=-1, scale_width=1, scale_height=1):
    if color is None:
        cmap = cm.get_cmap(cmap)
        cmap_grad = np.linspace(0, 256, len(data[0])).astype('int32')
        color = cmap(cmap_grad)
        color = np.vstack([color] * data.shape[0])
    x = np.ones(data.shape) * np.expand_dims(np.arange(1, len(data) + 1), -1)

    if show_y_zero:
        plt.axhline(0, color='k', linewidth=1)
    plt.scatter(x.flatten(), data.flatten(), marker=marker,
                s=markersize, linewidth=markerwidth, color=color)
    plt.gca().set_xticks(np.arange(1, len(data) + 1), minor=True)
    plt.gca().tick_params(which='minor', length=4)

    if show_title:
        plt.title("(Gradients vs. Samples) vs. Channels",
                  weight='bold', fontsize=title_fontsize)
    if not show_borders:
        plt.box(None)
    plt.gcf().set_size_inches(12*scale_width, 4*scale_height)
    plt.show()

完整的最小示例：请参阅存储库的 README

红利代码:

如何在不阅读源代码的情况下检查weight/gate排序？

rnn_cell = model.layers[1].cell          # unidirectional
rnn_cell = model.layers[1].forward_layer # bidirectional; also `backward_layer`
print(rnn_cell.__dict__)

有关更方便的代码，请参阅 repo 的 rnn_summary

额外事实：如果您在 GRU 上面运行，您可能会注意到 bias 没有门；为什么这样？来自 docs:

There are two variants. The default one is based on 1406.1078v3 and has reset gate applied to hidden state before matrix multiplication. The other one is based on original 1406.1078v1 and has the order reversed.

The second variant is compatible with CuDNNGRU (GPU-only) and allows inference on CPU. Thus it has separate biases for kernel and recurrent_kernel. Use 'reset_after'=True and recurrent_activation='sigmoid'.

如何在 Keras/TensorFlow 中可视化 RNN/LSTM 渐变？

How to visualize RNN/LSTM gradients in Keras/TensorFlow?

python

visualization

keras

tensorflow

recurrent-neural-network