关于 mel 滤波器组特征卷积的内核维度问题

Question on the kernel dimensions for convolutions on mel filter bank features

我目前正在尝试理解以下论文:https://arxiv.org/pdf/1703.08581.pdf。我很难理解关于如何对 log mel filterbank 特征的输入执行卷积的部分:

We train seq2seq models for both end-to-end speech translation, and a baseline model for speech recognition. We found that the same architecture, a variation of that from [10], works well for both tasks. We use 80 channel log mel filterbank features extracted from 25ms windows with a hop size of 10ms, stacked with delta and delta-delta features. The output softmax of all models predicts one of 90 symbols, described in detail in Section 4, that includes English and Spanish lowercase letters. The encoder is composed of a total of 8 layers. The input features are organized as a T × 80 × 3 tensor, i.e. raw features, deltas, and delta-deltas are concatenated along the ’depth’ dimension. This is passed into a stack of two convolutional layers with ReLU activations, each consisting of 32 kernels with shape 3 × 3 × depth in time × frequency. These are both strided by 2 × 2, downsampling the sequence in time by a total factor of 4, decreasing the computation performed in the following layers. Batch normalization [26] is applied after each layer.

据我了解,卷积层的输入是 3 维的(25 毫秒的数量 windows (T) x 80(每个 window 的特征)x 3(特征,增量特征和 delta-delta 特征)。但是,在这些输入上使用的内核似乎有 4 个维度,我不明白为什么会这样。4 维内核不需要 4 维输入吗?在我看来,输入有与 rgb 图片相同的尺寸:宽度(时间)x 高度(频率)x 颜色通道(特征、delta 特征和 delta-delta 特征)。因此我会认为 2D 卷积的内核是大小为 a ( filter width) x b (filter height) x 3 (depth of input)。我在这里遗漏了什么吗?我的想法有什么问题或者这篇论文有什么不同?

提前感谢您的回答!

我弄明白了,原来这只是我这边的一个误解:作者使用了 32 个形状为 3x3 的内核,结果(在两层 2x2 跨步之后)输出形状为 t/4x20x32 其中 t 代表时间维度。