需要实现与暹罗网络非常相似的深度学习架构

Need to implement Deep Learning architecture quite similar to Siamese Network

我必须实现这个网络:

类似于带有对比损失的孪生网络。我的问题是 S1/F1。论文是这样说的:

"F1 and S1 are neural networks that we use to learn the unit-normalized embeddings for the face and speech modalities, respectively. In Figure 1, we depict F1 and S1 in both training and testing routines. They are composed of 2D convolutional layers (purple), max-pooling layers (yellow), and fully connected layers (green). ReLU non-linearity is used between all layers. The last layer is a unit-normalization layer (blue). For both face and speech modalities, F1 and S1 return 250-dimensional unit-normalized embeddings".

我的问题是:

  1. 如何将二维卷积层(紫色)应用于形状为 (number of videos, number of frames, features) 的输入?
  2. 最后一层是什么?批量规范? F.normalize?

我将对您的两个问题进行详细的回答:

  1. 如果你正在使用 CNN,你很可能在你的输入中有空间信息,也就是说你的输入是一个二维多通道张量 (*, channels, height, width),而不是特征向量 (*, features)。如果您不保留二维性,您根本无法对输入应用卷积(至少是 2D 卷积)。

  2. 最后一层被描述为“单位标准化”层。这仅仅是使向量的范数单位(等于1)的操作。您可以通过将所述向量除以其范数来做到这一点。